jacobwindsor / pubchem-ranker

Ranks compounds by number of BioAssays or BioSystems in PubChem
MIT License
2 stars 1 forks source link

Allow for more file types & formatting #3

Open jacobwindsor opened 8 years ago

jacobwindsor commented 8 years ago

There are many file types that are commonly used for large sets of metabolites or other compounds. BioPAX, Octave, SciLab, XML are just a few. It would be great to support all of these formats.

Moreover, formatting of the dataset can vary greatly and the programme currently only allows for the CAS ID followed by the IUPAC name in brackets. A customisable REGEX string as a method parameter would be a much better method to harvest the data from the source file

jacobwindsor commented 7 years ago

What are the most commonly used file formats for this kind of thing? Is it CSV?

DeniseSl22 commented 7 years ago

Asking around tells me that tab-delimited files (as text-file) or comma-delimited (EXCEL) are being used. I think it would be nice if users could click on the type of file they have, and let the program then run for that type of file specifically. In this way, the program is faster (I hope) and usable by biologist that do not have any programming experience. See the picture below for an example of my idea ;) The first "button" accepts CSV, the second tab delimited, the third explains how other data files should be changed in order to use the ranker program.

image

jacobwindsor commented 7 years ago

Nice idea. Should be fairly simple to do

DeniseSl22 commented 7 years ago

Okay nice ;)

DeniseSl22 commented 7 years ago

I also found that ISATab (http://isa-tools.org/) are the most coomon file format for metabolomics data somewhere.

DeniseSl22 commented 7 years ago

http://regexr.com/ for advanced users. I'm going to ask (regular) biologists and chemists what they would like the RP to do, display etc.

DeniseSl22 commented 7 years ago

Should we still only use one metabolites dataset, or can we (and do we want to) included other dataset possibilities (proteomics, (environmental) chemistry, toxicology). And do we want to use another dataset to validate the RP?