Closed njbooher closed 3 years ago
Merging #325 (1b240ee) into master (72ba56c) will increase coverage by
0.08%
. The diff coverage is0.00%
.
@@ Coverage Diff @@
## master #325 +/- ##
==========================================
+ Coverage 68.71% 68.80% +0.08%
==========================================
Files 30 30
Lines 4060 4055 -5
Branches 235 235
==========================================
Hits 2790 2790
+ Misses 1211 1206 -5
Partials 59 59
Impacted Files | Coverage Δ | |
---|---|---|
machado/api/serializers.py | 0.00% <0.00%> (ø) |
|
machado/api/views.py | 0.00% <0.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 72ba56c...1b240ee. Read the comment docs.
I'm glad you already have it running on a live site. Congrats!
Before reviewing the code, I'd like to understand exactly what's the problem. I've noticed a few things: 1) there's no relationship among the features of the genome v1 2) the embedded jbrowse is showing up on the protein page, even though protein sequences definitely don't align to genome sequences (https://db.scnbase.org/feature/?feature_id=654331) 3) some proteins have more than one location (https://db.scnbase.org/feature/?feature_id=1071513)
Usually, there are no proteins in the GFF file and the load_gff tool creates protein entries for each mRNA.
The features are linked to each other using the key parent_id, to establish the relationship among them. Each feature is described in the GFF file in a single line, therefore each one of them is expected to have a single location.
There might be something going on with your GFF files. Would you please send me a link to them or maybe a snippet of the files so I can take a look?
Hello,
I suppose I should mention that the initial loading was done with Tripal years ago, and I'm working to move away from Tripal. I stood up a new postgres database, initialized machado, then imported the features from a dump of the Tripal chado database.
I'll try and find the gff3 files used for the initial load tomorrow, but if it helps, the previous JBrowse I'm still migrating from is here:
The annotations for this one are served from Chado using these queries: https://github.com/isubit/tripal_jbrowse_api/blob/master/includes/tripal_jbrowse_api.queries.inc
The gff3 sections for Hetgly.G000000004.gff3.txt (v1 genome) and HETGLY_00001.gff3.txt (v2 genome).
Links to related Tripal pages: https://scnbase.org/feature/Heterodera/glycines/mRNA/HETGLY_00001-RA?pane=relationships https://scnbase.org/feature/Heterodera/glycines-v2/mRNA/Hetgly.T000000004.1?pane=relationships
Looking at the second link I think that annotation might have been loaded multiple times rather than updated. I'll inquire about reloading it.
The GFF files are correct. There's only a single location for each feature ID. As you mentioned, there's a chance the dataset was loaded multiple times.
The problem is that the endpoint for the JBrowse data in the Machado API, is not built to handle multiple locations for a given feature. Where the data for the JBrowse you sent is stored?
Does JBrowse retrieve the data from the database or from indexed files?
From the database, using this code: https://github.com/isubit/tripal_jbrowse_api/blob/master/tripal_jbrowse_api.module#L350 And this query: https://github.com/isubit/tripal_jbrowse_api/blob/master/includes/tripal_jbrowse_api.queries.inc#L78
I believe this PR would make the Machado JBrowse API able to handle multiple locations for a feature.
The GFF snippets you sent and the JBrowse link, don't have multiple locations for the CDS features:
The current code should work for the GFF file you sent. We need to find out what's triggering the error.
But the jbrowse from the machado API is throwing the error you mentioned:
Can you run the following query in order to identify what are the CDS entries in your Featureloc table that have multiple locations?
First, access python manage.py shell
then ...
from machado.models import Featureloc
Featureloc.objects.filter(srcfeature__uniquename='scaffold_4', fmin__gte=1, fmax__lte=9320, feature__type__name='CDS').values_list('feature_id', 'feature__uniquename', 'fmin', 'fmax')
The results of that query are attached.
Followup queries:
SELECT * FROM featureloc WHERE feature_id IN (897199,897223,897243);
query_results_2.txt
Featureloc.objects.filter(srcfeature__uniquename='scaffold_4', fmin__gte=1, fmax__lte=9320, feature__type__name='CDS').values_list('featureloc_id','feature_id','feature__uniquename', 'srcfeature_id', 'srcfeature__uniquename', 'fmin','is_fmin_partial','fmax','is_fmax_partial','strand','phase','residue_info','locgroup','rank')
query_results_3.txt
(Thank you for taking the time to help figure out the cause)
I reloaded the database from scratch with the machado file loaders. Issue seems to be resolved. Thanks!
We're now using this on our live site: https://db.scnbase.org/
JBrowse api works for one of the genomes: https://db.scnbase.org/feature/?feature_id=530426
But breaks for the other: https://db.scnbase.org/feature/?feature_id=897201
This PR changes the basis for the JBrowse features to be Featureloc rather than Feature, which seems to resolve the issue.
I hereby agree to licence this and any previous contributions under the terms of the GNU General Public License version 3 as published by the Free Software Foundation
I have read the
CONTRIBUTING.rst
file and understand that TravisCI will be used to confirm the tests andflake8
style checks pass with these changes.