Closed RichardBruskiewich closed 2 years ago
Strange bug reported by @kevinschaper:
ERROR:koza.app:Validation error while processing:
{
'DB': 'WB',
'DB_Object_ID': 'WBGene00000013',
'DB_Object_Symbol': 'abf-2',
'Qualifier': 'involved_in',
'GO_ID': 'GO:0050830',
'DB_Reference': 'WB_REF:WBPaper00045314|PMID:24882217',
'Evidence_Code': 'IEP',
'With_or_From': '', 'Aspect': 'P',
'DB_Object_Name': '',
'DB_Object_Synonym': 'C50F2.10|C50F2.e',
'DB_Object_Type': 'gene',
'Taxon': 'taxon:6239|taxon:46170',
'Date': '20140827',
'Assigned_By': 'WB',
'Annotation_Extension': '',
'Gene_Product_Form_ID': ''
}
Traceback (most recent call last):
...
File "pydantic/dataclasses.py", line 99, in pydantic.dataclasses._generate_pydantic_post_init._pydantic_post_init
# +=======+=======+=======+
pydantic.error_wrappers.ValidationError: 4 validation errors for Gene
in_taxon
value is not a valid list (type=type_error.list)
in_taxon
string does not match regex "^[a-zA-Z_]?[a-zA-Z_0-9-]*:([A-Za-z0-9_][A-Za-z0-9_.-]*[A-Za-z0-9./\(\)\-><_:;]*)?$" (type=value_error.str.regex; pattern=^[a-zA-Z_]?[a-zA-Z_0-9-]*:([A-Za-z0-9_][A-Za-z0-9_.-]*[A-Za-z0-9./\(\)\-><_:;]*)?$)
in_taxon
string does not match regex "^(http|ftp)" (type=value_error.str.regex; pattern=^(http|ftp))
in_taxon
instance of OrganismTaxon, tuple or dict expected (type=type_error.dataclass; class_name=OrganismTaxon)
The PubMed reference is PMID:24882217 which talks about an interaction between C. elegans and S. aureus (which is why the value in the field is a piped string list).
Might not be a common occurrence, but perhaps, one to support. I'll modify the code accordingly. DONE: Commit https://github.com/monarch-initiative/monarch-ingest/pull/152/commits/c658aae1174bc7c108f745a68a233edb70331260.
Note that along the way, a flaw was discovered in the unit tests: the tests only run the last row of data in the test_rows()
fixture. I ended up commenting/uncommenting individual entries to run every test row separately to ensure that the underlying test passed. But, the test pattern needs to be fixed.
This is failing for me when loading the map. I'm going to see if I can tell what's going on in the file, maybe we'll need a Koza reader fix...
File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/app.py", line 134, in process_maps
self._load_map(map_file.config)
File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/app.py", line 186, in _load_map
for row in map_file:
File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/model/source.py", line 89, in __next__
row = self._get_row()
File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/model/source.py", line 100, in _get_row
row = next(self._reader)
File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/io/reader/csv_reader.py", line 105, in __next__
row = next(self.reader)
_csv.Error: field larger than field limit (131072)
Wow, it's a real row!
Q7GXZ4 Q7GXZ4_HUMAN 4539 YP_003024034.1 817037172; 401785009; 397182479; 556700648; 998452523; 700277390; 909720034; 758742540; 290544619; 298368112; 381274282; 1104574650; 1025814800; 821345341; 302320253; 381234327; 71016875; 556700550; 519124320; 78775942; 984293254; 156457386; 984292232; 168251526; 381255033; 656330855; 700278258; 82493917; 555291730; 381230015; 823331545; 1104585121; 334089072; 397175409; 133854536; 388891788; 614530982; 359295003; 381224905; 606226844; 823113825; 381252583; 223013625; 151330154; 381247767; 557465501; 656332563; 269316136; 381240823; 156077784; 334089296; 528888717; 806640027; 807780566; 530847738; 151334208; 156078232; 545775138; 69062732; 401783595; 700281170; 381233571; 410062800; 332691307; 350283244; 1135519733; 223010503; 385254205; 381245415; 556701096; 305654982; 229503345; 381262130; 150022547; 1104583665; 375071948; 110809883; 1104580081; 1049056799; 451686285; 69065322; 513788832; 606226074; 519127806; 883740820; 332167394; 556704428; 381232381; 150023313; 270486512; 381229189; 694880413; 375067832; 310776908;....
Q7GXZ4 = NADH-ubiquinone oxidoreductase chain 4L
@kevinschaper, I also note the use of semi-colons as a string list separator (not pipe '|' characters) which also tends to do funny things in the parsing process, if one is not aware of them.
This is failing for me when loading the map. I'm going to see if I can tell what's going on in the file, maybe we'll need a Koza reader fix...
File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/app.py", line 134, in process_maps self._load_map(map_file.config) File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/app.py", line 186, in _load_map for row in map_file: File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/model/source.py", line 89, in __next__ row = self._get_row() File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/model/source.py", line 100, in _get_row row = next(self._reader) File "/Users/kschaper/Documents/Monarch/monarch-ingest/.venv/lib/python3.9/site-packages/koza/io/reader/csv_reader.py", line 105, in __next__ row = next(self.reader) _csv.Error: field larger than field limit (131072)
Hmm.. Weird. Let me know what you find out. Perhaps I did something silly somewhere...
GOA ingest is substantially complete, ready for stress testing on full datasets. Resolves Monarch Ingest issue https://github.com/monarch-initiative/monarch-ingest/issues/64.