Closed RasmusOrsoe closed 5 months ago
@Aske-Rosted thank you for your detailed comments! I ended up refactoring set_gcd
to make it clearer what the function actually does. I think your questions were a great hint that this function was too compact and opaque - so I have expanded the docstring and added multiple comments.
It's a little hard to grasp all the details of where stuff has been moved, but I imagine it makes sense once you start using it.
Yes - it does look like a bit much, but it really isn't. You can see here how the new DataConverter
- GraphNeTWriter
- GraphNeTFileReader
combination looks like in practice.
I'm happy to elaborate on any bits of this code @Aske-Rosted
This PR refactors
DataConverter
such that it generalizes to other experiments than icecube, and allows for a very easy implementation of new data backends, i.e it closes #634.Please note that the many CodeClimate warnings are associated with code that this PR is not introducing. They are mainly concerned with complexity of icecube-related extractors, which we cannot realistically address.
This PR looks huge but it's not.
The main changes are:
graphnet.data.extractors.icecube
. In this new structure, future experiments should put extractors under their own folder; e.g.graphnet.data.extractors.km3net
. The content of our existing extractors have not changed.I3Extractor
now inherits from a generic base class for extractors calledExtractor
and can be found ingraphnet.data.extractors.extractor.py
DataConverter
that opens files and apply extractors to it has been moved to the generic base-classgraphnet.data.readers.GraphNeTFileReader
. An i3-specific file reader can be found ingraphnet.data.readers.I3Reader
. Future integrations of experiments should put their readers here. E.g.KM3NeTROOTReader.py
DataConverter
that saves the output of extractors has been moved to the generic base-classgraphnet.data.writers.GraphNeTWriter
. SQLite and parquet specific writers can be found in the same directory.DataConverter
s has been added undergraphnet.data.pre_configured
and here I've added anI3ToSQLiteConverter
andI3ToParquetConverter
- these are just pre-defined converters that are identical in terms of usage and functionality to our existingParquetDataConverter
andSQLiteDataConverter
. E.g the reader and writer modules are hardcoded.I3ToSQLiteConverter
andI3ToParquetConverter
undergraphnet.data.{parquet,sqlite}
that are namedParquetDataConverter
andSQLiteDataConverter
. These classes allows existing data conversion scripts to function, but will add a deprecation warning to the logs. Once we release graphnet 2.0, these two directories should be deleted.I3GenericExtractor
because it does not (and has never really) fit well into our data processing. It is also quite slow, and my impression is that it's largely unused. It would require quite a re-write of the module to make it fit, and I am not too keen on that.Usage Integrating a data source from new experiment requires implementing a new experiment-specific subclass of
GraphNeTFileReader
, which is fairly straight forward, as it entails writing just a single abstract method. E.g :where
MyExtractor
is a data-source specific implementation of the generic base classExtractor
:This means that
MyCustomReader
uniquely defines the input to the correspondingMyExtractor
and how they're called; vectorized, serially, etc. The only limitation imposed by us is that the output format ofMyCustomReader
(see docstring).Likewise, extending the new
DataConverter
to output files in a new data backend is easy. It requires only the implementation of a single abstract method:the
_save_file
method receives a dictionary on one of two possible forms, controlled by the class variable_merge_dataframes
. If_merge_dataframes= True
,_save_file
receives a singlepd.DataFrame
object per extractor that needs to be saved to disk. If_merge_dataframes = False
,_save_file
receives apd.DataFrame
object per event per extractor. By using the pandas framework for this part, exporting the data to new formats is incredible easy. E.g. a writer that saves each extractor output as separate.csv
files would be: