lmb-embrapa / machado

This repository provides users with a framework to store, search and visualize biological data.
GNU General Public License v3.0
26 stars 15 forks source link

Jbrowse use featureloc #325

Closed njbooher closed 3 years ago

njbooher commented 3 years ago

We're now using this on our live site: https://db.scnbase.org/

JBrowse api works for one of the genomes: https://db.scnbase.org/feature/?feature_id=530426

But breaks for the other: https://db.scnbase.org/feature/?feature_id=897201

  File "/opt/app-root/lib64/python3.8/site-packages/machado/api/serializers.py", line 114, in get_start
    return self._get_location(obj).fmin
  File "/opt/app-root/lib64/python3.8/site-packages/machado/api/serializers.py", line 89, in _get_location
    feature_loc = Featureloc.objects.get(
  File "/opt/app-root/lib64/python3.8/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/opt/app-root/lib64/python3.8/site-packages/django/db/models/query.py", line 433, in get
    raise self.model.MultipleObjectsReturned(
machado.models.Featureloc.MultipleObjectsReturned: get() returned more than one Featureloc -- it returned 15!

This PR changes the basis for the JBrowse features to be Featureloc rather than Feature, which seems to resolve the issue.

I hereby agree to licence this and any previous contributions under the terms of the GNU General Public License version 3 as published by the Free Software Foundation

I have read the CONTRIBUTING.rst file and understand that TravisCI will be used to confirm the tests and flake8 style checks pass with these changes.

codecov-commenter commented 3 years ago

Codecov Report

Merging #325 (1b240ee) into master (72ba56c) will increase coverage by 0.08%. The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #325      +/-   ##
==========================================
+ Coverage   68.71%   68.80%   +0.08%     
==========================================
  Files          30       30              
  Lines        4060     4055       -5     
  Branches      235      235              
==========================================
  Hits         2790     2790              
+ Misses       1211     1206       -5     
  Partials       59       59              
Impacted Files Coverage Δ
machado/api/serializers.py 0.00% <0.00%> (ø)
machado/api/views.py 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 72ba56c...1b240ee. Read the comment docs.

azneto commented 3 years ago

I'm glad you already have it running on a live site. Congrats!

Before reviewing the code, I'd like to understand exactly what's the problem. I've noticed a few things: 1) there's no relationship among the features of the genome v1 2) the embedded jbrowse is showing up on the protein page, even though protein sequences definitely don't align to genome sequences (https://db.scnbase.org/feature/?feature_id=654331) 3) some proteins have more than one location (https://db.scnbase.org/feature/?feature_id=1071513)

Usually, there are no proteins in the GFF file and the load_gff tool creates protein entries for each mRNA.

The features are linked to each other using the key parent_id, to establish the relationship among them. Each feature is described in the GFF file in a single line, therefore each one of them is expected to have a single location.

There might be something going on with your GFF files. Would you please send me a link to them or maybe a snippet of the files so I can take a look?

njbooher commented 3 years ago

Hello,

I suppose I should mention that the initial loading was done with Tripal years ago, and I'm working to move away from Tripal. I stood up a new postgres database, initialized machado, then imported the features from a dump of the Tripal chado database.

I'll try and find the gff3 files used for the initial load tomorrow, but if it helps, the previous JBrowse I'm still migrating from is here:

https://scnbase.org/jbrowse/?data=https%3A%2F%2Fscnbase.org%2Fapi%2Fjbrowse%2F16&loc=scaffold_1%3A802488..1203733&tracks=refseq%2Cgenes%2Ctranscripts&highlight=

The annotations for this one are served from Chado using these queries: https://github.com/isubit/tripal_jbrowse_api/blob/master/includes/tripal_jbrowse_api.queries.inc

njbooher commented 3 years ago

The gff3 sections for Hetgly.G000000004.gff3.txt (v1 genome) and HETGLY_00001.gff3.txt (v2 genome).

Links to related Tripal pages: https://scnbase.org/feature/Heterodera/glycines/mRNA/HETGLY_00001-RA?pane=relationships https://scnbase.org/feature/Heterodera/glycines-v2/mRNA/Hetgly.T000000004.1?pane=relationships

Looking at the second link I think that annotation might have been loaded multiple times rather than updated. I'll inquire about reloading it.

azneto commented 3 years ago

The GFF files are correct. There's only a single location for each feature ID. As you mentioned, there's a chance the dataset was loaded multiple times.

The problem is that the endpoint for the JBrowse data in the Machado API, is not built to handle multiple locations for a given feature. Where the data for the JBrowse you sent is stored?

https://scnbase.org/jbrowse/?data=https%3A%2F%2Fscnbase.org%2Fapi%2Fjbrowse%2F16&loc=scaffold_4%3A9181..12908&tracks=refseq%2Cgenes%2Ctranscripts&highlight=

Does JBrowse retrieve the data from the database or from indexed files?

njbooher commented 3 years ago

From the database, using this code: https://github.com/isubit/tripal_jbrowse_api/blob/master/tripal_jbrowse_api.module#L350 And this query: https://github.com/isubit/tripal_jbrowse_api/blob/master/includes/tripal_jbrowse_api.queries.inc#L78

I believe this PR would make the Machado JBrowse API able to handle multiple locations for a feature.

azneto commented 3 years ago

The GFF snippets you sent and the JBrowse link, don't have multiple locations for the CDS features:

https://scnbase.org/jbrowse/?data=https%3A%2F%2Fscnbase.org%2Fapi%2Fjbrowse%2F16&loc=scaffold_4%3A1..9320&tracks=refseq%2Cgenes%2Ctranscripts&highlight=

The current code should work for the GFF file you sent. We need to find out what's triggering the error.

But the jbrowse from the machado API is throwing the error you mentioned:

https://db.scnbase.org/static/jbrowse/?data=data%2FHeterodera%20glycines%20v2&loc=scaffold_4%3A1..9317&tracklist=1&nav=1&overview=1&tracks=ref_seq%2Cgene%2Ctranscripts%2CCDS%2CSNV&highlight=

Can you run the following query in order to identify what are the CDS entries in your Featureloc table that have multiple locations? First, access python manage.py shell

then ...

from machado.models import Featureloc Featureloc.objects.filter(srcfeature__uniquename='scaffold_4', fmin__gte=1, fmax__lte=9320, feature__type__name='CDS').values_list('feature_id', 'feature__uniquename', 'fmin', 'fmax')

njbooher commented 3 years ago

The results of that query are attached.

query_results.txt

njbooher commented 3 years ago

Followup queries: SELECT * FROM featureloc WHERE feature_id IN (897199,897223,897243); query_results_2.txt

Featureloc.objects.filter(srcfeature__uniquename='scaffold_4', fmin__gte=1, fmax__lte=9320, feature__type__name='CDS').values_list('featureloc_id','feature_id','feature__uniquename', 'srcfeature_id', 'srcfeature__uniquename', 'fmin','is_fmin_partial','fmax','is_fmax_partial','strand','phase','residue_info','locgroup','rank') query_results_3.txt

(Thank you for taking the time to help figure out the cause)

njbooher commented 3 years ago

I reloaded the database from scratch with the machado file loaders. Issue seems to be resolved. Thanks!