monarch-initiative / monarch-ingest

Data ingest application for Monarch Initiative knowledge graph using Koza
https://monarchinitiative.org
15 stars 2 forks source link

GO annotations ingest #152

Closed RichardBruskiewich closed 2 years ago

RichardBruskiewich commented 2 years ago

GOA ingest is substantially complete, ready for stress testing on full datasets. Resolves Monarch Ingest issue https://github.com/monarch-initiative/monarch-ingest/issues/64.

RichardBruskiewich commented 2 years ago

Strange bug reported by @kevinschaper:

ERROR:koza.app:Validation error while processing: 
{
   'DB': 'WB',
   'DB_Object_ID': 'WBGene00000013', 
   'DB_Object_Symbol': 'abf-2', 
   'Qualifier': 'involved_in', 
   'GO_ID': 'GO:0050830', 
   'DB_Reference': 'WB_REF:WBPaper00045314|PMID:24882217',
   'Evidence_Code': 'IEP', 
   'With_or_From': '', 'Aspect': 'P', 
   'DB_Object_Name': '', 
   'DB_Object_Synonym': 'C50F2.10|C50F2.e', 
   'DB_Object_Type': 'gene',
   'Taxon': 'taxon:6239|taxon:46170', 
   'Date': '20140827', 
   'Assigned_By': 'WB', 
   'Annotation_Extension': '', 
   'Gene_Product_Form_ID': ''
}
Traceback (most recent call last):
...
  File "pydantic/dataclasses.py", line 99, in pydantic.dataclasses._generate_pydantic_post_init._pydantic_post_init
    # +=======+=======+=======+
pydantic.error_wrappers.ValidationError: 4 validation errors for Gene
in_taxon
  value is not a valid list (type=type_error.list)
in_taxon
  string does not match regex "^[a-zA-Z_]?[a-zA-Z_0-9-]*:([A-Za-z0-9_][A-Za-z0-9_.-]*[A-Za-z0-9./\(\)\-><_:;]*)?$" (type=value_error.str.regex; pattern=^[a-zA-Z_]?[a-zA-Z_0-9-]*:([A-Za-z0-9_][A-Za-z0-9_.-]*[A-Za-z0-9./\(\)\-><_:;]*)?$)
in_taxon
  string does not match regex "^(http|ftp)" (type=value_error.str.regex; pattern=^(http|ftp))
in_taxon
  instance of OrganismTaxon, tuple or dict expected (type=type_error.dataclass; class_name=OrganismTaxon)

The PubMed reference is PMID:24882217 which talks about an interaction between C. elegans and S. aureus (which is why the value in the field is a piped string list).

Might not be a common occurrence, but perhaps, one to support. I'll modify the code accordingly. DONE: Commit https://github.com/monarch-initiative/monarch-ingest/pull/152/commits/c658aae1174bc7c108f745a68a233edb70331260.

Note that along the way, a flaw was discovered in the unit tests: the tests only run the last row of data in the test_rows() fixture. I ended up commenting/uncommenting individual entries to run every test row separately to ensure that the underlying test passed. But, the test pattern needs to be fixed.

kevinschaper commented 2 years ago

This is failing for me when loading the map. I'm going to see if I can tell what's going on in the file, maybe we'll need a Koza reader fix...

  File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/app.py", line 134, in process_maps
    self._load_map(map_file.config)
  File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/app.py", line 186, in _load_map
    for row in map_file:
  File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/model/source.py", line 89, in __next__
    row = self._get_row()
  File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/model/source.py", line 100, in _get_row
    row = next(self._reader)
  File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/io/reader/csv_reader.py", line 105, in __next__
    row = next(self.reader)
_csv.Error: field larger than field limit (131072)
kevinschaper commented 2 years ago

Wow, it's a real row!

Q7GXZ4  Q7GXZ4_HUMAN    4539    YP_003024034.1  817037172; 401785009; 397182479; 556700648; 998452523; 700277390; 909720034; 758742540; 290544619; 298368112; 381274282; 1104574650; 1025814800; 821345341; 302320253; 381234327; 71016875; 556700550; 519124320; 78775942; 984293254; 156457386; 984292232; 168251526; 381255033; 656330855; 700278258; 82493917; 555291730; 381230015; 823331545; 1104585121; 334089072; 397175409; 133854536; 388891788; 614530982; 359295003; 381224905; 606226844; 823113825; 381252583; 223013625; 151330154; 381247767; 557465501; 656332563; 269316136; 381240823; 156077784; 334089296; 528888717; 806640027; 807780566; 530847738; 151334208; 156078232; 545775138; 69062732; 401783595; 700281170; 381233571; 410062800; 332691307; 350283244; 1135519733; 223010503; 385254205; 381245415; 556701096; 305654982; 229503345; 381262130; 150022547; 1104583665; 375071948; 110809883; 1104580081; 1049056799; 451686285; 69065322; 513788832; 606226074; 519127806; 883740820; 332167394; 556704428; 381232381; 150023313; 270486512; 381229189; 694880413; 375067832; 310776908;....

Q7GXZ4 = NADH-ubiquinone oxidoreductase chain 4L

@kevinschaper, I also note the use of semi-colons as a string list separator (not pipe '|' characters) which also tends to do funny things in the parsing process, if one is not aware of them.

RichardBruskiewich commented 2 years ago

This is failing for me when loading the map. I'm going to see if I can tell what's going on in the file, maybe we'll need a Koza reader fix...

  File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/app.py", line 134, in process_maps
    self._load_map(map_file.config)
  File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/app.py", line 186, in _load_map
    for row in map_file:
  File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/model/source.py", line 89, in __next__
    row = self._get_row()
  File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/model/source.py", line 100, in _get_row
    row = next(self._reader)
  File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/io/reader/csv_reader.py", line 105, in __next__
    row = next(self.reader)
_csv.Error: field larger than field limit (131072)

Hmm.. Weird. Let me know what you find out. Perhaps I did something silly somewhere...