lmb-embrapa / machado

This repository provides users with a framework to store, search and visualize biological data.
GNU General Public License v3.0
26 stars 15 forks source link

load_fasta failed #362

Closed mthang closed 10 months ago

mthang commented 12 months ago

Setup

Python version: Operating system:

import sys; print(sys.version)
import platform; print(platform.python_implementation()); print(platform.platform())

import sys; print(sys.version) 3.11.3 (main, Apr 19 2023, 23:54:32) [GCC 11.2.0] import platform; print(platform.python_implementation()); print(platform.platform()) CPython Linux-5.4.0-162-generic-x86_64-with-glibc2.31

Expected behaviour

I was trying to load the fasta file into the load using the following command python manage.py load_fasta --file genome.fa --soterm chromosome --organism 'Escherichia coli'

and received the following error message. raise self.model.DoesNotExist( machado.models.Cvterm.DoesNotExist: Cvterm matching query does not exist.

So I dont know what went wrong.

Thanks,

azneto commented 12 months ago

It seems like the Cvterm (Controlled vocabulary term) chromosome is not registered in the database. This term is contained in the sequence ontology. Have you loaded the ontologies? https://machado.readthedocs.io/en/latest/load_ontologies.html

mthang commented 12 months ago

You are right . I didnt register the chromosome in the database because I just want to load a reference genome and a gff3 file to feed the jbrowse. I wasn't aware that loading the cvterm was required. FYI, I have followed the instructions on this doc https://machado.readthedocs.io/en/latest/installation.html to install the machado and the link up the jbrowse. However, I am not able to make the machado and jbrowse working accordingly.

mthang commented 12 months ago

I am still getting this error message "machado.models.Cvterm.DoesNotExist: Cvterm matching query does not exist" even I have loaded the so.obo, ro.obo and go.obo following the instructions on https://machado.readthedocs.io/en/latest/load_ontologies.html

I followed this instructions https://machado.readthedocs.io/en/latest/installation.html to install and configure the machado, but I can't tell what I missed.

BTW, I tried to load the ecoli reference genome using load_fasta tool.

Thanks,

azneto commented 12 months ago

You are right . I didnt register the chromosome in the database because I just want to load a reference genome and a gff3 file to feed the jbrowse. I wasn't aware that loading the cvterm was required. FYI, I have followed the instructions on this doc https://machado.readthedocs.io/en/latest/installation.html to install the machado and the link up the jbrowse. However, I am not able to make the machado and jbrowse working accordingly.

It will be much easier and faster if you just index the fasta and gff files in order to JBrowse itself access them.

There are plenty tutorials about it, for example: https://jbrowse.org/docs/faq_setup.html https://gist.github.com/darencard/4db3be0c396dd24a5dbdec649ca4adf9

azneto commented 12 months ago

I am still getting this error message "machado.models.Cvterm.DoesNotExist: Cvterm matching query does not exist" even I have loaded the so.obo, ro.obo and go.obo following the instructions on https://machado.readthedocs.io/en/latest/load_ontologies.html

I followed this instructions https://machado.readthedocs.io/en/latest/installation.html to install and configure the machado, but I can't tell what I missed.

BTW, I tried to load the ecoli reference genome using load_fasta tool.

Thanks,

That's strange. In order to check if the ontologies are loaded correctly, can you post the results of the following commands:

Inside Django shell tool (python manage.py shell):

> from machado.models import Cvterm
> Cvterm.objects.filter(name='chromosome').values_list('cv__name', 'cvterm_id', 'name')
mthang commented 11 months ago

@azneto I get this result "<QuerySet [('sequence', 359, 'chromosome')]>" from running the Django shell tool. FYI, I am rebuilding the DB and have the so.obo loaded. Now, I am loading the names.dmp.

azneto commented 11 months ago

Based on the query result, it seems like the sequence ontology is loaded correctly. The step 2.2 related to loading the NCBI taxonomy is optional. If you skip it, it's necessary to insert the organism name (2.3), in your case 'Escherichia coli'.

Can you post the complete error message of the following command? python manage.py load_fasta --file genome.fa --soterm chromosome --organism 'Escherichia coli' --verbosity 3

mthang commented 11 months ago

@azneto Thank you for looking into this issue ! FYI, I have loaded the NCBI taxonomy into the DB (step 2.2) because I did try to skip it and use "insert_organism" (step 2.3). It didnt work.

The error below is after I loaded the NCBI taxonomy into the DB.

python manage.py load_fasta --file genome.fa --soterm chromosome --organism 'Escherichia coli' --verbosity 3 (it works when I change the --verbosity to --version)

python manage.py load_gff --file ~/genes_NCBI.sorted.gff3.gz --organism "Escherichia coli" --version 3 (this works as well when --version flag is provided)

My observatin is both command lines above wont work without the --version flag.

Then, the question is how to find them under "Data summary" on the website? I cant find anything under the data summary after uploaded the fasta and gff file.

azneto commented 11 months ago

That's good news! After loading the files, the website will only show a few numbers. The full interface relies on the ElasticSearch engine, therefore you'll need to go through the following steps: https://machado.readthedocs.io/en/latest/index_search.html

mthang commented 11 months ago

I have installed the elasticsearch and the steps from the url. I dont know what I have done wrong because there is 0 features for indexing after running python manage.py rebuild_index. It's strange. any thougths?

I executed the curl command and got the following message. curl -XPUT "http://localhost:9200/haystack/_settings" -d '{ "index" : { "max_result_window" : 500000 } }' -H "Content-Type: application/json" {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [haystack]","resource.type":"index_or_alias","resource.id":"haystack","index_uuid":"na","index":"haystack"}],"type":"index_not_found_exception","reason":"no such index [haystack]","resource.type":"index_or_alias","resource.id":"haystack","index_uuid":"na","index":"haystack"},"status":404}

azneto commented 11 months ago

If you use the parameter --version, it will show the version of the program and exit, therefore nothing will be loaded. The parameter --verbosity will increase/decrease the amount of details of the output. Can you post the full error message using --verbosity 3?

mthang commented 11 months ago

It is the same error message I posted here in the beginning.

python manage.py load_fasta --file genome.fa --soterm chromosome --organism 'Escherichia coli' --verbosity 3 Preprocessing Traceback (most recent call last): File "/home/mthang/YOURPROJECT/WEBPROJECT/manage.py", line 22, in main() File "/home/mthang/YOURPROJECT/WEBPROJECT/manage.py", line 18, in main execute_from_command_line(sys.argv) File "/home/mthang/YOURPROJECT/lib/python3.11/site-packages/django/core/management/init.py", line 446, in execute_from_command_line utility.execute() File "/home/mthang/YOURPROJECT/lib/python3.11/site-packages/django/core/management/init.py", line 440, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/home/mthang/YOURPROJECT/lib/python3.11/site-packages/django/core/management/base.py", line 402, in run_from_argv self.execute(*args, cmd_options) File "/home/mthang/YOURPROJECT/lib/python3.11/site-packages/django/core/management/base.py", line 448, in execute output = self.handle(*args, *options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mthang/YOURPROJECT/lib/python3.11/site-packages/machado/management/commands/load_fasta.py", line 90, in handle sequence_file = SequenceLoader( ^^^^^^^^^^^^^^^ File "/home/mthang/YOURPROJECT/lib/python3.11/site-packages/machado/loaders/sequence.py", line 42, in init self.cvterm_contained_in = Cvterm.objects.get( ^^^^^^^^^^^^^^^^^^^ File "/home/mthang/YOURPROJECT/lib/python3.11/site-packages/django/db/models/manager.py", line 85, in manager_method return getattr(self.get_queryset(), name)(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mthang/YOURPROJECT/lib/python3.11/site-packages/django/db/models/query.py", line 650, in get raise self.model.DoesNotExist( machado.models.Cvterm.DoesNotExist: Cvterm matching query does not exist.

mthang commented 11 months ago

I didnt make much changes to setup the machado by following the installation instructions on the readdoc url. I am not sure if there is something missing from this guide or there is something wrong in my configuration/setup.

Q: is the "Loading Feature Additional Info" mandatory?

mthang commented 11 months ago

I just had a look at the organism.py. On line 72, the organism object does not have "type_id" but I can see the "type_id" column in the organism table in the db. The chado table schema here also has the type_id which is the same as the machado chado schema. I wonder which part of the organism.py code inserted the type_id into the organism object.

azneto commented 11 months ago

It is the same error message I posted here in the beginning.

python manage.py load_fasta --file genome.fa --soterm chromosome --organism 'Escherichia coli' --verbosity 3 Preprocessing Traceback (most recent call last): File "/home/mthang/YOURPROJECT/WEBPROJECT/manage.py", line 22, in main() File "/home/mthang/YOURPROJECT/WEBPROJECT/manage.py", line 18, in main execute_from_command_line(sys.argv) File "/home/mthang/YOURPROJECT/lib/python3.11/site-packages/django/core/management/init.py", line 446, in execute_from_command_line utility.execute() File "/home/mthang/YOURPROJECT/lib/python3.11/site-packages/django/core/management/init.py", line 440, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/home/mthang/YOURPROJECT/lib/python3.11/site-packages/django/core/management/base.py", line 402, in run_from_argv self.execute(*args, cmd_options) File "/home/mthang/YOURPROJECT/lib/python3.11/site-packages/django/core/management/base.py", line 448, in execute output = self.handle(*args, options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mthang/YOURPROJECT/lib/python3.11/site-packages/machado/management/commands/load_fasta.py", line 90, in handle sequence_file = SequenceLoader( ^^^^^^^^^^^^^^^ File "/home/mthang/YOURPROJECT/lib/python3.11/site-packages/machado/loaders/sequence.py", line 42, in init* self.cvterm_contained_in = Cvterm.objects.get( ^^^^^^^^^^^^^^^^^^^ File "/home/mthang/YOURPROJECT/lib/python3.11/site-packages/django/db/models/manager.py", line 85, in manager_method return getattr(self.get_queryset(), name)(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/mthang/YOURPROJECT/lib/python3.11/site-packages/django/db/models/query.py", line 650, in get raise self.model.DoesNotExist( machado.models.Cvterm.DoesNotExist: Cvterm matching query does not exist.

The error message refers to a missing term: cvterm_contained_in This term is loaded using the command load_relations_ontology

https://machado.readthedocs.io/en/latest/load_ontologies.html#relations-ontology

Have you loaded the relations ontology? If you query the database, this term is registered in the cvterm table?

You can verify it using the following commands:

$ python manage.py shell
>>> from machado.models import Cvterm
>>> Cvterm.objects.filter(name='contained in').values_list('cv__name', 'cvterm_id', 'name')
<QuerySet [('relationship', 37, 'contained in')]>
azneto commented 11 months ago

I didnt make much changes to setup the machado by following the installation instructions on the readdoc url. I am not sure if there is something missing from this guide or there is something wrong in my configuration/setup.

Q: is the "Loading Feature Additional Info" mandatory?

Can you try running the installation steps without making any changes at all? If you follow each step rigorously it should work fine.

"Loading Feature Additional Info" is not mandatory, as most of the commands. But each dataset depends on the other. To load FASTA and GFF files, it's required to have the three ontologies load and the organism these files are related to. Everything else is optional.

azneto commented 11 months ago

I just had a look at the organism.py. On line 72, the organism object does not have "type_id" but I can see the "type_id" column in the organism table in the db. The chado table schema here also has the type_id which is the same as the machado chado schema. I wonder which part of the organism.py code inserted the type_id into the organism object.

The type_id is used to associate a feature or organism to an ontology. This field is not required.

mthang commented 11 months ago

Thank you for the pointers! I did load the ro.obo file.

I just tested the query code you provided above, the output is Cvterm.objects.filter(name='contained in').values_list('cv__name', 'cvterm_id', 'name')

I thought it might be my db was empty. Then I executed the following and got some result

Cvterm.objects.filter(name='has part').values_list('cv__name', 'cvterm_id', 'name') <QuerySet [('relationship', 8, 'has part')]>

I went to check the ro.obo content - https://github.com/oborel/obo-relations/blob/master/ro.obo. The only term "contained in" has become "obsolete contained in" in line 1007, so I checked my db with the following the sql statement.

Here's the output select * from cvterm where name='obsolete contained in' limit 10; cvterm_id | name | definition | is_obsolete | is_relationshiptype | cv_id | dbxref_id -----------+-----------------------+------------+-------------+---------------------+-------+----------- 37 | obsolete contained in | | 0 | 1 | 1 | 37

Any thougths? Much appreciated!

mthang commented 11 months ago

FYI, both load_fasta and load_gff work after I updated the "obsolete contained in" the cvterm table to "contained in". Due to the value changed on the name in the table cvterm, I kept receiving the error mesasge " Cvterm matching query does not exist" . Then, I reviewed the loaders/sequence.py code and it seems like the "contained in" is hard coded.

I simply changed the value in the cvterm table and reran

python manage.py load_fasta --file genome.fa --soterm chromosome --organism 'Escherichia coli' --verbosity 3

AND

python manage.py load_gff --file ~/genes_NCBI.sorted.gff3.gz --organism "Escherichia coli" --verbosity 3

It works.

Thanks.

azneto commented 11 months ago

I'm glad you found a solution. Sorry for all your trouble, but fortunately it led us to pinpoint a new error. This term was modified a little over a month ago, thus we'll have to properly address this issue.

https://github.com/oborel/obo-relations/issues/693#issuecomment-1522019794

The term 'contained in' is now obsolete in favor of 'located in'. The solution is to update this term in the code since it's hardcoded, as you've noticed. Can you make a pull request?

Here are the files that need update (replace 'contained in' for 'located in'):

machado/loaders/biomaterial.py:            name="contained in", cv__name="relationship"
machado/loaders/similarity.py:                name="contained in", cv__name="relationship"
machado/loaders/project.py:            name="contained in", cv__name="relationship"
machado/loaders/feature.py:            name="contained in", cv__name="relationship"
machado/loaders/assay.py:            name="contained in", cv__name="relationship"
machado/loaders/analysis.py:            name="contained in", cv__name="relationship"
machado/loaders/sequence.py:            name="contained in", cv__name="relationship"
machado/tests/test_loaders_assay.py:            name="contained in",
machado/tests/test_loaders_orthology.py:            name="contained in",
machado/tests/test_loaders_analysis.py:            name="contained in",
machado/tests/test_loaders_featureattributes.py:            name="contained in",
machado/tests/test_loaders_feature.py:            name="contained in",
machado/tests/test_loaders_feature.py:            name="contained in",
machado/tests/test_loaders_feature.py:            name="contained in",
machado/tests/test_loaders_feature.py:            name="contained in",
machado/tests/test_loaders_feature.py:            name="contained in",
machado/tests/test_loaders_feature.py:            name="contained in",
machado/tests/test_loaders_sequence.py:            name="contained in",
machado/tests/test_loaders_similarity.py:            name="contained in",
machado/tests/test_loaders_coexpression.py:            name="contained in",
machado/tests/test_loaders_coexpression.py:            name="contained in",
machado/tests/test_loaders_biomaterial.py:            name="contained in",
machado/tests/test_loaders_project.py:            name="contained in",
machado/management/commands/remove_analysis.py:            name="contained in", cv__name="relationship"
machado/management/commands/remove_relationship.py:            cvterm = Cvterm.objects.get(name="contained in", cv__name="relationship")
mthang commented 11 months ago

Great ! I will work on it and make a PR when it's done.

mthang commented 11 months ago

@azneto Done with the update! https://github.com/lmb-embrapa/machado/pull/363