monarch-initiative / SLDBGen

6 stars 3 forks source link

ValueError: Could not find id for gene CDR1 in Steckel 2012 #198

Closed hansenp closed 2 years ago

hansenp commented 2 years ago

I checked out SLDBGen and was able to successfully complete all setup commands in the README.md. There were no errors in the tests either.

But when I try to generate the list with the SL pairs, I get the following error message:

$ python parse_human_SLI.py
[INFO] Astsaturov et al 2010: 60 positive and 0 negative entries
[INFO] Baldwin et al 2010: 2 positive and 0 negative entries
[INFO] Blomen et al 2015: 133 positive and 0 negative entries
[INFO] Bommi et al 2008: 178 positive and 0 negative entries
[INFO] Brough et al 2018: 182 positive and 0 negative entries
[INFO] chakraborty et al 2017: 13 positive and 17 negative entries
[INFO] chin 2020: 3 positive and 0 negative entries
[INFO] Dai et al 2013: 8 positive and 0 negative entries
[INFO] Etemadmoghadam et al 2013: 24 positive and 0 negative entries
[INFO] Han et al 2017: 30 positive and 0 negative entries
[INFO] Josse et al 2014: 7 positive and 1424 negative entries
[INFO] Kang et al 2015: 8 positive and 0 negative entries
[INFO] Kessler et al 2012: 383 positive and 0 negative entries
[INFO] Kim et al 2011: 15 positive and 0 negative entries
[INFO] Krastev et al 2011: 3 positive and 0 negative entries
[INFO] Lord et al 2008: 9 positive and 124 negative entries
[INFO] Luo et al 2009: 81 positive and 0 negative entries
[INFO] Manually entered single-SLI studies (part zero): 51 positive and 6 negative entries
[INFO] Manually entered single-SLI studies (part one): 70 positive and 0 negative entries
[INFO] Manually entered (3): 6 positive and 0 negative entries
[INFO] Martin et al 2010/2011: 18 positive and 2 negative entries
[INFO] Mengwasser et al 2019: 5 positive and 0 negative entries
[INFO] Mohni et al 2014: 41 positive and 307 negative entries
[INFO] Mondal et al 2019: 6 positive and 4 negative entries
[INFO] Oser et al 2019: 103 positive and 0 negative entries
[INFO] Patidar 2020: 13 positive and 3 negative entries
[INFO] Schick et al 2019  n=3 SL interactions
[INFO] Schick et al 2019: 3 positive and 0 negative entries
[INFO] Shen et al 2015 : 7 positive and 104 negative entries
[INFO] Shen et al 2017: 168 positive and 10 negative entries
[INFO] Srivas et al 2016: 180 positive and 0 negative entries
Traceback (most recent call last):
  File "parse_human_SLI.py", line 146, in <module>
    steckel2012_list = steckel2012.parse()
  File "/Users/hansep/PycharmProjects/SLDBGen/idg2sl/parsers/steckel_2012_parser.py", line 75, in parse
    raise ValueError("Could not find id for gene %s in Steckel 2012" % geneB_sym)
ValueError: Could not find id for gene CDR1 in Steckel 2012
caufieldjh commented 2 years ago

I also got this error. Strange, because CDR1 appear to be right here: https://github.com/monarch-initiative/SLDBGen/blob/master/data/steckel-2012-KRAS.tsv#L311

caufieldjh commented 2 years ago

Ah, the real issue is probably that CDR1 isn't in the downloaded protein-coding_gene.txt. There's a similar problem in Sun 2019 with C7orf26 (see Sun-TableS9.txt) Removing the missing gene symbols from steckel-2012-KRAS.tsv and Sun-Table29.txt allows the process to complete.

pnrobinson commented 2 years ago

Strange I am not getting this error https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:1798

pnrobinson commented 2 years ago

@caufieldjh I am now adding some new publications but would like to finalize this dataset next week. Is this error still happening on your side?

caufieldjh commented 2 years ago

I appear to still be experiencing this issue on a fresh install - running python parse_human_SLI.py gets me a virtually identical error to what's seen above:

$ python parse_human_SLI.py
[INFO] Will download  protein-coding_gene.txt
100% [..........................................................................] 9293239 / 9293239[INFO] Astsaturov et al 2010: 60 positive and 0 negative entries
[INFO] Baldwin et al 2010: 2 positive and 0 negative entries
[INFO] Blomen et al 2015: 133 positive and 0 negative entries
[INFO] Bommi et al 2008: 178 positive and 0 negative entries
[INFO] Brough et al 2018: 182 positive and 0 negative entries
[INFO] chakraborty et al 2017: 13 positive and 17 negative entries
[INFO] chin 2020: 3 positive and 0 negative entries
[INFO] Dai et al 2013: 8 positive and 0 negative entries
[INFO] Etemadmoghadam et al 2013: 24 positive and 0 negative entries
[INFO] Han et al 2017: 30 positive and 0 negative entries
[INFO] Josse et al 2014: 7 positive and 1424 negative entries
[INFO] Kang et al 2015: 8 positive and 0 negative entries
[INFO] Kessler et al 2012: 383 positive and 0 negative entries
[INFO] Kim et al 2011: 15 positive and 0 negative entries
[INFO] Krastev et al 2011: 3 positive and 0 negative entries
[INFO] Lord et al 2008: 9 positive and 124 negative entries
[INFO] Luo et al 2009: 81 positive and 0 negative entries
[INFO] Manually entered single-SLI studies (part zero): 51 positive and 6 negative entries
[INFO] Manually entered single-SLI studies (part one): 70 positive and 0 negative entries
[INFO] Manually entered (3): 6 positive and 0 negative entries
[INFO] Martin et al 2010/2011: 18 positive and 2 negative entries
[INFO] Mengwasser et al 2019: 5 positive and 0 negative entries
[INFO] Mohni et al 2014: 41 positive and 307 negative entries
[INFO] Mondal et al 2019: 6 positive and 4 negative entries
[INFO] Oser et al 2019: 103 positive and 0 negative entries
[INFO] Patidar 2020: 13 positive and 3 negative entries
[INFO] Schick et al 2019  n=3 SL interactions
[INFO] Schick et al 2019: 3 positive and 0 negative entries
[INFO] Shen et al 2015 : 7 positive and 104 negative entries
[INFO] Shen et al 2017: 168 positive and 10 negative entries
[INFO] Srivas et al 2016: 180 positive and 0 negative entries
Traceback (most recent call last):
  File "parse_human_SLI.py", line 146, in <module>
    steckel2012_list = steckel2012.parse()
  File "/home/harry/SLDBGen/idg2sl/parsers/steckel_2012_parser.py", line 75, in parse
    raise ValueError("Could not find id for gene %s in Steckel 2012" % geneB_sym)
ValueError: Could not find id for gene CDR1 in Steckel 2012

I experience the same error on the develop branch.

pnrobinson commented 2 years ago

@caufieldjh I found out that the error is coming because the script downloads the protein-coding_gene.txt file from HGNC, and on my machine I was still using an old version. In the meantime, two genes have changed symbols or definitions. I could reproduce the above error and was able to fix it, and think things should be OK now. Please close if you can run the script without error.

caufieldjh commented 2 years ago

Great, it works without issue now:

We got 12391 interactions including 2685 synthetic lethal interactions

(I don't have the permissions to close this issue, unfortunately)

pnrobinson commented 2 years ago

Thanks!