HPOA: Uncomment headers?

matentzn commented 3 years ago

Would it be possible to uncomment the headers in hpoa, so rather than:

#description: HPO annotations for rare diseases [7801: OMIM; 47: DECIPHER; 3958 ORPHANET]
#date: 2020-08-11
#tracker: https://github.com/obophenotype/human-phenotype-ontology
#HPO-version: http://purl.obolibrary.org/obo/hp.obo/hp/releases/2020-08-11/hp.obo.owl
#DatabaseID DiseaseName Qualifier   HPO_ID  Reference   Evidence    Onset   Frequency   Sex Modifier    Aspect  Biocuration
OMIM:618850 Hypervalinemia or hyperleucine-isoleucinemia        HP:0010913  PMID:25653144   PCS     1/1         P   HPO:probinson[2020-07-23];HPO:probinson[2020-07-23]

having:

#description: HPO annotations for rare diseases [7801: OMIM; 47: DECIPHER; 3958 ORPHANET]
#date: 2020-08-11
#tracker: https://github.com/obophenotype/human-phenotype-ontology
#HPO-version: http://purl.obolibrary.org/obo/hp.obo/hp/releases/2020-08-11/hp.obo.owl
DatabaseID  DiseaseName Qualifier   HPO_ID  Reference   Evidence    Onset   Frequency   Sex Modifier    Aspect  Biocuration
OMIM:618850 Hypervalinemia or hyperleucine-isoleucinemia        HP:0010913  PMID:25653144   PCS     1/1         P   HPO:probinson[2020-07-23];HPO:probinson[2020-07-23]

This would make it considerably easier to process it automatically with normal data science toolkits like pandas:

df = pd.read_csv(hpoa, sep="\t", comment='#')

Downside would be that tools that are currently using the file might break. But right now, there is no principled way to get at the column names (imagine you add a column in the future! Not sure about this, but thought I'd raise it.

pnrobinson commented 3 years ago

We actually just changed it from no # in the header line to having a '#' in the header line because of a request. There are multiple ways to ingest data like this, and so I am not sure I agree that the pandas way is normal. I am completely open to both options, do others have comments? @dosumis @drseb @cmungall @balhoff ?

matentzn commented 3 years ago

I am also happy to change my ways if you know any way of doing it in python! Not trying to make stuff difficult :)

pnrobinson commented 3 years ago

can't you use header=None in the above pandas command? Alternatively, the csv library in Python is nice. If you do have a header, then the DictReader is great, but if not basically you just need to figure out the index of the fields you want to parse. For phenotype.hpoa the following works to skip the # lines and process the rest of the file (untested from memory, but it should at least be close!)

with open(fname) as f:
    reader = csv.reader (filter(lambda row: row[0]!='#',f),delimiter='\t')
    for row in reader:
        print(row)

row is now a python list of the fields of each annotation.

matentzn commented 3 years ago

Ah yes, header=None what I am doing now; I meant more a standard way to actually add the columns back on; Now I have to hardcode the columns in my script like this:

hpoa_columns = ["DatabaseID", "DiseaseName", "Qualifier", 
                "HPO_ID", "Reference", "Evidence", "Onset", 
                "Frequency", "Sex", "Modifier", "Aspect", "Biocuration" ]

df = pd.read_csv(hpoa, header=None, sep="\t", comment='#')
df.columns = hpoa_columns

But what will happen if one day, you add a column? Or remove one? Or change the order?

pnrobinson commented 3 years ago

Well, the format will definitely need to evolve, so don't let your guard down! :-0 Maybe it is time to write a Python library that will just figure things like this out? :-0 Otherwise, we might just notice that we can't really levitate...

matentzn commented 3 years ago

Yeah, not a big deal in any case; I already love it that at least we have metadata in that file.. In any case, we can close this if no one else thinks this is an issue :P

I will work on producing a sssom derivative of hpoa file at some point, which will solve the issue from my perspective in any case! Thanks for the feedback!

callahantiff commented 3 years ago

In the meantime, I handle these things with unit tests or try-except statements 😊

From: Peter Robinson notifications@github.com Sent: Wednesday, September 23, 2020 5:19:30 AM To: obophenotype/human-phenotype-ontology human-phenotype-ontology@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [obophenotype/human-phenotype-ontology] HPOA: Uncomment headers? (#6208)

Well, the format will definitely need to evolve, so don't let your guard down! :-0 Maybe it is time to write a Python library that will just figure things like this out? :-0 Otherwise, we might just notice that we can't really levitate...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/obophenotype/human-phenotype-ontology/issues/6208#issuecomment-697299351, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB5IRG2SCQHN4QWL25FRWDTSHHKUFANCNFSM4RWXJQ2Q.

pnrobinson commented 3 years ago

OK, I think we can close this for now.

obophenotype / human-phenotype-ontology

HPOA: Uncomment headers? #6208