lisad / phaser

library for batch-oriented complex data integration pipelines
MIT License
4 stars 1 forks source link

Make file name accessible (in batch? in context?) in phases while running #152

Open lisad opened 3 weeks ago

lisad commented 3 weeks ago

In phaser-example, the Seattle data for bicycle counters separates each location into a different file, whereas the data format we decided to output from the pipeline has a column for location description.

Since the Seattle data puts the location name in the file name, one place we could get the location name from is the file name "Burke Gilman Trail NE 70th Bicycle Pedestrian Counter 20240705.csv" "Thomas St Overpass Bike Ped Counter 20240526.csv"

in fact this illustrates another common pattern, which is to put the date of a data file in the name of the file.

For pipelines that need to move information out of the filename into a field, how should we give access to the source file name?

  1. It's possible to do today by overriding the init_source method in the Pipeline to learn the name of 'source' and add that to the Context as a variable, then call super().init_source and proceed... later on, a step can pull the location out of the source filename and add it as a column value. Pretty complicated but we could document it.

  2. Another approach would be to allow the command line invocation to pass a variable name in so the person typing in the command line would type python3 -m phaser run seattle output "sources/Burke Gilman Trail NE 70th Bicycle Pedestrian Counter 20240705.csv" --var location="Burke Gilman Trail NE 70th". Passing variables on the command line is a good idea anyway for all kinds of variables so I'll create a separate ticket for it.

  3. The phaser library could provide the source file names by adding them to the context for all steps and phases to access:

    • For extra sources, this would happen in pipeline in init_source . Currently context saves source names and data, but the names are the internal names like "temp_data" not the external file name like "temps Seattle 20240606.csv", so the data structure does not currently have room for the filename
    • The main source would have to be handled differently.

I think 3 is a good idea, but it's not trivial and may involve some refactoring of loading sources.

jeffkole commented 3 weeks ago

Update phaser-example to use this new feature