lmb-embrapa / machado

This repository provides users with a framework to store, search and visualize biological data.
GNU General Public License v3.0
26 stars 14 forks source link

Error when attempting to load features with the same name for different organisms #344

Closed njbooher closed 1 year ago

njbooher commented 2 years ago

Setup

>>> import sys; print(sys.version)
3.8.8 (default, Aug 11 2021, 06:52:42) 
[GCC 8.5.0 20210514 (Red Hat 8.5.0-3)]
>>> import platform; print(platform.python_implementation()); print(platform.platform())
CPython
Linux-5.14.9-200.fc34.x86_64-x86_64-with-glibc2.2.5

Actual behaviour

I am attempting to load sequences and features for multiple organisms that only have draft genomes available. Accordingly, both assemblies have sequence names like scaffold_1. Loading the sequences for the first genome works fine, but when I try to load the second I get errors about that sequence already existing in the database.

Expected behaviour

The unique key for the Feature class is unique_together = (("organism", "uniquename", "type"),), so I would expect to be able to have features with the same name for different organisms.

Renaming the sequences to add a unique prefix before loading isn't really desirable as we already have downstream analysis files that use those sequence names.

Most of this can be traced to retrieve_feature_id in loaders/common.py only taking 2 parameters: accession: str, soterm: str.

I would be willing to make a PR to change it to add an organism parameter, and update the places it's called to pass that. However, this method is also used by views.FeatureIDViewSet, and I think fixing that would involve changing that URL path to include an organism name or ID. I believe views.JBrowseNamesViewSet would also need changed since it doesn't receive an organism as a path or query parameter.

I wanted to check if such a pull request is likely to be accepted before working on it.

azneto commented 2 years ago

The main reason that led us to make uniquename distinct was the JBrowse API.

Take a look:

"Also, track types that display features as boxes laid out on the genome (such as HTMLFeatures, CanvasFeatures, Alignments, and Alignments2 ) require all top-level features to have a globally-unique ID, which should be set as uniqueID in the JSON emitted by the service. " (https://jbrowse.org/docs/data_formats.html)

The features API doesn't have an organism parameter (GET (base)/features/(refseq_name)?start=234&end=5678)

The workaround was adding unique prefixes to uniquename, and storing the original value in the name field.

njbooher commented 2 years ago

The way we workaround this with our Tripal sites that I'm trying to move is by prefixing the JBrowse api routes with the organism ID

E.g.

https://www.serioladb.org/jbrowse/?data=../api/jbrowse/13&tracks=refseq%2Cgenes%2C212578 https://abalone.dbgenome.org/jbrowse/?data=../api/jbrowse/3&tracks=refseq%2Cgenes%2C413423

azneto commented 2 years ago

I guess that will solve the problem. Let me go through every loading tool to verify the implications of your proposal and I'll let you know.

azneto commented 2 years ago

After reviewing the implications of your proposal, we're happy to evaluate your PR to adequate the Machado software to the original database schema in regard to using the Feature unique key ("organism", "uniquename", "type").

The new retrieve_feature_id function should receive a string parameter containing the species name and use retrieve_organism to retrieve the organism from the database.

Here are the files that will require modifications:

api/views.py
loaders/similarity.py
loaders/analysis.py
loaders/feature.py
loaders/common.py
loaders/sequence.py

The api.views.FeatureIDViewSet is not used elsewhere in Machado. It's an API endpoint that we thought would be important for the API users to have and should be refactored to receive the organism ID. It's desirable to create a new endpoint to enable API users to retrieve the organism ID by species name (OrganismIdViewSet), similarly to what FeatureIDViewSet does.

The api.views.JBrowseNamesViewSet already requires the organism parameter and won't need changes (eg. https://www.machado.cnptia.embrapa.br/demo_machado/api/jbrowse/names?organism=Zea%20mays&equals=AC148152.3_FGT001.v6a)

The JBrowse URL is constructed in decorators.py and already contains the species name in the URL, therefore won't need changes (eg. https://www.machado.cnptia.embrapa.br/jbrowse_demo/?data=data%2FZea%20mays&loc=Zmays_2%3A232027460..232031697&tracks=ref_seq%2Cgene%2Ctranscripts%2CCDS).

njbooher commented 2 years ago

Ok, thanks! I'm out of the office next week but will work on this when I get back.