jhu-bids / omop-vocab-on-fhir

OMOP2FHIR converter specifically for CodeSystems.
MIT License
0 stars 0 forks source link

`KeyError`: `concept_dict[row['concept_id_1']]` #3

Closed mgkahn closed 2 years ago

mgkahn commented 2 years ago

Saw your posting to Tiffany Callahan's OMOP2OBO site. I bask in the shadow of both Tiffany and Shahim.......

Based on Issue #1, I may simply be too early with kicking the tires on the code......

Screen Shot 2022-07-14 at 5 01 41 PM
mgkahn commented 2 years ago

This is a known issue already on the TODO list so removing.

joeflack4 commented 2 years ago

@mgkahn Thanks very much for opening this issue; I greatly appreciate the usage and feedback!

I couldn't actually find the TODO that you're referring to. Regardless, though, I do like to have all issues of significance to have a corresponding GitHub issue. I usually write a TODO instead of a GitHub issue when I'm short on time.

Possible causes

1. OMOP Version

At first, I was thinking that it could be an OMOP version issue (but now I think possible cause (2) or (3) is more likely). Currently, only OMOP version 5 is supported. I'm still a little new to OMOP, and I haven't yet examined anything prior to version 5. For now, I've made an issue regarding this. I also just added a CLI param for --omop-version, as well as some documentation regarding that, including the current restriction to v5.

2. Python is not reading concept_id fields using the same data type among different CSVs

The DtypeWarning that you're getting is something I still need to address. I have not yet assigned specific data types (e.g. integer, string) to each of the fields in each of the OMOP tables. It could be the case that my script is reading the concept IDs as integers in one CSV, but strings in the other. I've just pushed a commit that standardizes them to integers (though I still need to do this the proper way; using pandas), so if this was the cause, it should be fixed now.

3. Inconsistencies between CONCEPT.csv and CONCEPT_RELATIONSHIP.csv

The script will first parse CONCEPT.csv to build up a concept_dict dictionary, where all of the keys are concept_id. Then, it parses CONCEPT_RELATIONSHIP.csv, adding this information to concept_dict.

Can check and see if a concept with ID 40756823 in the concept_id field of your CONCEPT.csv? And if it is indeed missing, then you should contact the publisher of the OMOP set that you're working with letting them know about the issue, if you can. If it's not missing, then the cause was probably (2) as explained above.

For now, I've added exception handling, so you should be able to get past this error now. Instead, it'll show you this warning if the issue indeed persists:

Warning: Concept with id {row["concept_id_1"]} appeared in the concept_id_1 column of CONCEPT_RELATIONSHIP.csv, but was not found in CONCEPT.csv. This concept will thus be excluded.


Try running again and see if my fixes to (2) and (3) worked.

mgkahn commented 2 years ago

Joe: I did a fresh download of the vocabulary tables from Athena on the same day I executed the code.

For #1, you could do this check automatically in code by looking at vocabulary_id = None in the vocabulary table (select vocabulary_version from vocabulary where vocabulary_id = "None" and then decide if the rest of the code can execute.

For #2, I used the CSV files as provided from the Athena download zip. I'm running on a Mac but that should not be an issue

For #3, an issue with the Athena-generated files only. I did "grep 40756823 *.csv" -- no hits from CONCEPT.csv, but a match in concept_cpt4.csv. In a real-world setting, like our production OMOP, concept_cpt4.csv would be merged into concept.csv after pulling the CPT labels from UMLS (this is a step done outside of Athena because of the need for the UMLS license.

joeflack4 commented 2 years ago

Thanks for your responses!

Regarding (1), correct me if I'm wrong, but isn't VOCABULARY.CSV simply a list of various OMOP vocabularies, and vocabulary_version within that CSV represents the version of the vocabulary, not the version of OMOP, e.g. OMOP CDM v5.3.1 (I actually didn't realize these versions were that granular). When I look at that column in my CSV, I see things like version "US Census 2017 Release" for vocabulary "US Census", version "SOPT Version 9.2" for vocabulary "SOPT", etc.

This is actually one thing I observed about OMOP that I do not like. I'm not sure why they include this list of vocabularies that are not related to the vocabulary I have downloaded. Nor does it appear to be a comprehensive list.

As for identification of OMOP version, I have actually written that in already. Basically, the user passes the name of the vocabulary in the CLI. Then it looks up and tries to find a match in VOCABULARY.csv, and extracts the version. Then it includes that information in the resulting FHIR JSON, as well as the filename. If it doesn't find a match, it writes "unknown-version".

Regarding (2), this is a theoretical issue. It is theoretically possible that (2) could happen, but I'm not sure how likely it is. Everything in the releases (about ~12 or so vocabularies) was able to run fine without encountering this issue, but I have a fix for it now just in case.

Regarding (3), it seems that my suspicion was correct. I wonder why this was the case, though. This is in CPT4? I wonder if your CPT4 is different from mine. The CPT4 that I've included in releases (which I probably need to encrypt as well and allow decryption using UMLS key) was able to convert without error. I downloaded it as an OMOP v5 copy of CPT4 from the OHDSI Athena instance a few weeks ago.

I'm curious as to what your workflow looks like using CPT4. The one I downloaded from OHDSI Athena included labels; I didn't need to query the UMLS API or anything. However, I was not able to populate the tables until running a .jar file contained in the .zip downloaded from Athena and passing my UMLS key to it. Has your workflow been similar, or different somehow?

mgkahn commented 2 years ago

For #1 -- you want the version of the vocabulary tables, not the OMOP CDM. These usually go hand-in-hand but the "label" for the vocabulary version has the last updated time. For example, using the query I wrote above (or looking at the row in VOCABULARY.csv where vocabulary_id = 'None'), you find: "V5.0 22-JUN-22" So this is a V5 vocabulary (what you know you want) last updated on 22-JUN-22 ( may or may not be useful to you in later versions of the tool).

Screen Shot 2022-07-18 at 8 24 01 PM

Regarding workflow -- I was simply being too lazy -- I just selected all the check boxes from Athena and let it do its thing and what I got back was the following files. Until this exchange, I completely forgot about the CPT issue so I did NOT run the jar, which I should have done which would have merged the CPT codes into the concept.csv file. So the missing key was based on my laziness, not an error.

Screen Shot 2022-07-18 at 8 27 38 PM

All of the above notwithstanding, I'm not getting the tool to run, even with -vo 5 added. I don't know why line 354 in omop_vocab_on_fhir.py is failing:

Screen Shot 2022-07-18 at 8 30 00 PM
joeflack4 commented 2 years ago

Ohh man, my bad. It should run now, try again. Sorry I was editing in a hurry and didn't run it / don't have any tests.

I'll take a look at your other comments tomorrow.

joeflack4 commented 2 years ago

Ahhhh, regarding VOCABULARY.csv where vocabulary_id = 'None', I see what you mean now! That's great. I will utilize this. I'm wondering if it's possible that older versions (say, as old as v1) don't have this, so I think I'll leave the --omop-version CLI arg for now until I'm sure, and put a note about how it handles defaults in the --help.

I think both the OMOP CDM version and the vocabulary version itself are important, for provenance reasons if not for anything else.

I see now why the script failed initially due to not having run the CPT4 .jar. Should be fixed now!

@mgkahn The bug you experienced last also should be fixed!

Gonna close this issue now as I believe everything has been addressed. Feel free to open a new issue if/whenever you encounter any other issues!


edit: Remaining questions

  1. Do you know why the VOCABULARY.csv shows a bunch of random vocabularies? I'd expect it to be either (a) a comprehensive list, or (b) only the version of the vocabulary downloaded and the OMOP CDM version. Any idea why it appears to be (c) OMOP CDM version and seemingly random list of vocabularies?
mgkahn commented 2 years ago

"Random vocabularies" -- not sure to what you are referring but the full set of OHDSI vocabularies includes licensed vocabularies that requires extra steps (uploading your license to Athena) for these vocabularies to be included in an Athena download. The "plain vanilla" download only includes those vocabularies that are public domain and/or freely licensed by OHDSI for OHDSI use (only). So the vocabulary.csv table has entries that are not included in a "Joe-Public" download without additional work to get access to licensed vocabs. Does that answer what I think is your question?

joeflack4 commented 2 years ago

Hmm I see. I think I understand. I have only been downloading them one at a time, opting not to select several for download at once. I was confused as to why some vocabularies in addition to the 1 that I had checked off were included in this table, and further confused why the entries in the CSV did not include all of the vocabularies available for download. Now I see that this has to do with licensing. As you say, it will show a list of all the vocabularies on the server that are available for download and public domain / freely licensed as such. This really clears up some confusion I had; thanks!

Also I just pushed an update:

    - Documentation: Updated for `--help` text for several args.
    - Update: Arg `--omop-version` changed to `--omop-cdm-version`.
    - Update: Automated lookup of `--omop-cdm-version` in VOCABULARY.csv.
    - Update: File names now include `--omop-cdm-version`.
    - Update: Removed `--upload` for now. Was not yet implemented anyway.
    - Added: Additional exception handling for locating files.
    - Update: `--all-code-systems`: (i) Made 'done' column optional, (ii) made column names consistent w/ arg names, (iii) re-enabled 'extended' format, (iv) config.tsv changed to config.csv, documentation added, and support for edge case tab-separated added, (v) added `--help` text to documentation.

If you happened to be playing with this for the last couple hours, it would not have been stable. I've been pushing directly to main today. I'll avoid that and push to develop in the future.