Separate PiMP pre-processing from database loading in initialisedb command

joewandy commented 4 years ago

Not sure if it's already the case now, but it would be great if inside the initialisedb script, we can cleanly separate (1) the codes that deal with the pre-processing of PiMP data, and (2) the codes that load the pre-processed data into the database.

Ideally for part (1), we have a single object or method that uses only pandas or plain Python. It takes as input the path to the sample CSV, intensity and annotation JSON files from PiMP. The output is a list of cleaned peaks alongside their high-confidence compound annotations. This way it can be used in other workflow as well to clean imported PiMP data, e.g. in WebOmics.

Part (2) would then take the output from part (1) and populate Django database with it. I guess that will be unique to this project and won't really be re-used elsewhere. However if the codes are neatly put in one place, it makes it easier to optimise the database loading performance later on.

Q: can I use the PeakSelector and CompoundSelector to do only part 1 @kmcluskey, even without having a database at all? From the codes, it seems that it isn't entirely possible yet, but with small changes, it might be.

joewandy commented 4 years ago

Would be good if the codes to pull the related chebi ids can also be made standalone as well.

kmcluskey commented 4 years ago

I think that is possible but I wasn't sure of the format? A python dict? or code?

joewandy commented 4 years ago

I think that is possible but I wasn't sure of the format? A python dict? or code?

Python dict would be fine I think? Key is the initial chebi id, and the value is a list of related chebi ids.

kmcluskey / FlyMet

Separate PiMP pre-processing from database loading in initialisedb command #59