Strain-to-species mapping doesn't affect already added organisms

jseager7 commented 6 months ago

The strain-to-species mapping that PHI-Canto does for NCBI Taxonomy IDs doesn't seem to apply to organisms that have already been added to a curation session.

I was trying to remap the strain-level ID Aspergillus oryzae (strain ATCC 42149 / RIB 40) [510516] to the species-level ID for Aspergillus oryzae [5062]. I added the configuration to species_strain_map.yaml:

  5062:
    reference_strain: 5062
    other_strains:
      - 90341
      - 301529
      - 332754
      - 510516  # strain-level ID

But the strain-level organism was not updated in the curation session after restarting Canto:

The session database still has the strain-level taxonomy ID in it.

I can't find where in the code the species-to-strain remapping is meant to happen, but I'm guessing it takes place when the organism is first loaded or when the session is first created.

@kimrutherford Please can you point me in the right direction?

kimrutherford commented 6 months ago

I can't find where in the code the species-to-strain remapping is meant to happen, but I'm guessing it takes place when the organism is first loaded or when the session is first created.

Hi James. Yep, the mapping happens on loading. The code is here: https://github.com/pombase/canto/blob/c01491b8ab1800f91d3fce4e24dc43d180cb4d72/lib/Canto/UniProt/GeneLookup.pm#L75-L88

The get_species_taxon_of_strain_taxon() method uses the configuration from species_strain_map.yaml.

The genes in the session databases have a reference to the organism that is set when the gene is added to the session: https://github.com/pombase/canto/blob/c01491b8ab1800f91d3fce4e24dc43d180cb4d72/etc/curs.sql#L3-L12

Possibly we need a script that updates genes to change the organism that they reference, base on the current configuration?

jseager7 commented 6 months ago

@kimrutherford Thanks for the information. A script would work fine for me, but I might need your help with that unless you can provide an example I can work from.

For a long-term solution, I'm thinking either:

I could update the strain-to-species mapping file to account for all the strains in the NCBI Taxonomy database (and hope that the file isn't enormous), or
Canto could automatically resynchronise its organisms with the strain-to-species mapping file on start-up, so only a restart of Canto would be needed. But I'm guessing that touching every session database on start-up might not be a very safe thing to do.

kimrutherford commented 6 months ago

Canto could automatically resynchronise its organisms with the strain-to-species mapping file on start-up, so only a restart of Canto would be needed. But I'm guessing that touching every session database on start-up might not be a very safe thing to do.

I think it would be safe but a bit slow. It might make sense to have a script that runs nightly to to re-synchronise organisms.

A script would work fine for me, but I might need your help with that unless you can provide an example I can work from.

I'll have a dig around to find an example.

jseager7 commented 6 months ago

I'll have a dig around to find an example.

Thanks very much. I was thinking some more about the long-term solution and I think the most complete solution could be to make the strain-to-species mapping cover every strain-level rank in the NCBI Taxonomy database. Last I heard, the NCBI has no plans to mint any more strain-level identifiers, so it should be possible to create a complete mapping to prevent this from happening again.

kimrutherford commented 6 months ago

I think the most complete solution could be to make the strain-to-species mapping cover every strain-level rank in the NCBI Taxonomy database.

Do you have any idea how big that mapping would be? Maybe not too bad if you limited it to the organisms of interest to PHI-base?

I'll have a dig around to find an example.

Sorry that's taken so long. I'm having a look now, trying to refresh my memory about how strains work in Canto.

jseager7 commented 6 months ago

Do you have any idea how big that mapping would be? Maybe not too bad if you limited it to the organisms of interest to PHI-base?

I guess I'll find out when I try to create it 🙂

I think we tried to filter down to organisms of interest when we first created the mapping, but this issue shows that there are clearly some that we've still missed. I'd like to try to be as comprehensive as possible, as long as it doesn't slow down Canto too much when loading, but I could at least filter out viruses (which we don't curate).

kimrutherford commented 6 months ago

Hi James. It would help a lot if I had a test database/directory with (approximately) real PHI-Canto data. Do you have one you use for testing?

Due to a computer failure I don't have access to my old test PHI-Canto directory.

jseager7 commented 5 months ago

@kimrutherford Sorry about the delay on this, I was on annual leave last week. I'll work on getting a database ready to send to you, hopefully today but definitely by tomorrow.

jseager7 commented 5 months ago

I've sent a minimal copy of the database now.

kimrutherford commented 5 months ago

I've sent a minimal copy of the database now.

Thanks James, that's just what I needed. And I appreciate the README that you included too.

kimrutherford commented 5 months ago

Hi @jseager7

I've had a first go at this: reapply_species_strain_map.pl.

It works on the test database you sent me but I plan to double check the code and do more testing tomorrow after some sleep. I'm sure I'm missed some edge cases. I don't advise you to run on your production Canto yet. :-)

Having said that, it's probably in a state where you could run it on a test database. The script has no arguments so you should be able to run it with something like:

canto/script/canto_docker ./etc/reapply_species_strain_map

You'll get output like:

updating Q2TWM0 in 5f40edb1ef06aecf
  taxon ID 510516 -> 5062
updating Q2U3V9 in 5f40edb1ef06aecf
  taxon ID 510516 -> 5062
updating Q2U3V9 in TrackDB
updating Q2TWM0 in TrackDB

It's a bit verbose at the moment. I'll tone it down once it's in a working state.

Let me know how it goes.

https://github.com/pombase/canto/blob/master/etc/reapply_species_strain_map.pl

jseager7 commented 5 months ago

@kimrutherford Thanks a lot for working on this. I've tested the script on a copy of the session from the production database and it remaps the taxon ID of the genes as expected.

The only problem is I'm left with the strain-level organism on the summary screen, even though the organism no longer exists:

The organism is missing from the Edit Genes page, so I can't remove it there:

kimrutherford commented 5 months ago

Hi James. Thanks for testing.

I forgot to garbage collect the unused organisms. I've fixed that now. Could you try again?

jseager7 commented 5 months ago

@kimrutherford I've tested the script again, but it still doesn't remove the unused organism. I've been looking at the database and I think this is because the script doesn't update the organism references in the strain and genotype tables (as far as I can see, it only touches the gene table).

Here's a visualisation of the references in the database:

I tested this by manually updating the organism ID in the strain and genotype tables in a local copy of the database before running the script, then ran the script, and the organism was deleted as expected.

So I think the missing steps are:

Update the organism_id column in the genotype table
Update the organism_id column in the strain table

Sorry for adding more work; hopefully that change is similar to the existing code.

kimrutherford commented 5 months ago

Hi James. Thanks for the detail. Do you have a session database I could test with?

jseager7 commented 5 months ago

Yep, I've sent another session database.

kimrutherford commented 5 months ago

So I think the missing steps are:

I've changed the code to cover those cases and also added a step to remove duplicated rows in the strain table after the organisms are re-mapped.

Maybe it won't happen, but I was worried that after a row in the strain table was updated the organism of reference strain (confusingly terminology unfortunately), there might be a duplicate - two rows with the same organism_id and strain_name.

Could you try the updated script?

Thanks!

jseager7 commented 5 months ago

@kimrutherford I've tried the script and it deletes the unused organism as expected:

However, I get a ton of messages in the terminal about uninitialized string values:

Use of uninitialized value in concatenation (.) or string at /canto/etc/reapply_species_strain_map.pl line 65.
Use of uninitialized value in hash element at /canto/etc/reapply_species_strain_map.pl line 161.

I'm pretty sure this happens when the session database has rows in the strain table that reference strains in the Track database. The strain_name is null in this case since it reuses the name from the Track database. See the image below:

Unfortunately the test database didn't catch this case because it only uses custom strains, which only exist in the Curs database. If you need another session database that has Track strains in it, please let me know.

As far as I can tell, the script doesn't actually cause any problems to existing sessions due to these warnings.

With regards to this point:

Maybe it won't happen, but I was worried that after a row in the strain table was updated the organism of reference strain (confusingly terminology unfortunately), there might be a duplicate - two rows with the same organism_id and strain_name.

I think this is possible if two organisms have the same strain name in our strain list, though hopefully it's unlikely to happen in practice since we don't use strain-level taxonomy IDs in the strain table that we load into Canto. Still, good to have this check here.

kimrutherford commented 5 months ago

I'm pretty sure this happens when the session database has rows in the strain table that reference strains in the Track database.

I forgot about that situation!

Could you try this patch?

Thanks.

diff --git a/etc/reapply_species_strain_map.pl b/etc/reapply_species_strain_map.pl                                                                
index 7609ec7be..511327ca8 100755                                                                                                                 
--- a/etc/reapply_species_strain_map.pl                                                                                                           
+++ b/etc/reapply_species_strain_map.pl                                                                                                           
@@ -62,7 +62,8 @@ sub make_strain_key                                                                                                             
 {                                                                                                                                                
   my $strain = shift;                                                                                                                            

-  return $strain->organism_id() . '-' . $strain->strain_name();                                                                                  
+  return $strain->organism_id() . '-' .                                                                                                          
+    ($strain->track_strain_id() // $strain->strain_name());                                                                                      
 }                                                                                                                                                

 my $proc = sub {

jseager7 commented 5 months ago

I've tried that patch and it fixes the uninitialized string warnings, thanks.

I've been looking into the other warning, and I don't think it's a problem with the script.

Use of uninitialized value in hash element at /canto/etc/reapply_species_strain_map.pl line 161.

It's caused by this part of the code:

    my $updated_strain_id =
      $strain_id_update_map{$genotype->strain_id()};

I checked the offending session and it looks like this is a problem with PHI-Canto: we somehow created a wild-type genotype with no strain. No idea how this happened since the UI is meant to prevent it:

The offending genotype isn't referenced by any metagenotypes in the metagenotype table either. Very odd.

I might have to patch the session database to fix this.

kimrutherford commented 5 months ago

I've tried that patch and it fixes the uninitialized string warnings, thanks.

Great.

we somehow created a wild-type genotype with no strain.

Yep, that's not supposed to happen!

Use of uninitialized value in hash element at /canto/etc/reapply_species_strain_map.pl line 161.

I think the code will do the correct thing in this case, despite the warning.

jseager7 commented 5 months ago

I think the code will do the correct thing in this case, despite the warning.

Great. I'll run the code on our production server tomorrow and close this issue if there are no problems.

jseager7 commented 5 months ago

I've run the update on the production server and all seems fine, just forgot to close the issue like I said I would.

Thanks again for your help Kim, considering this turned out to be somewhat complicated.

pombase / canto

Strain-to-species mapping doesn't affect already added organisms #2831