Expression data loader - Error (feature loading)

martacds commented 5 years ago

Hi! I have uploaded the FASTA files and the gff necessary and I have published both the genes and mRNA. They show up correctly as Tripal content Capture3

This is an example of one of the genes: Capture4

I have uploaded the BioSample xmls and am now trying to upload the expression data as a matrix file but it is giving me the following error:

ERROR: The feature, LOC111983044, found in the expression file was not found in the Chado database. Please ensure that the feature has been loaded into the database and that the feature name is both unique and correct.

I have tried with both options as name and unique name. I have also tried two different matrix files where the features are loaded as either LOCXXX or geneXXXX.

What am I missing?

Thank you in advance!

almasaeed2010 commented 5 years ago

Hello @marcsilvaitqb

Did you try selecting a sequence type for your genes? Your example Gene page suggests that you loaded your sequences as gene.

Thanks!

martacds commented 5 years ago

Hi @almasaeed2010

When uploading the genomic fasta file I selected region as the sequence type because my file is composed of scaffolding sequences and in a previous upload that is what was used. But now for the expression data I selected gene as the sequence type, considering that that is the content of my matrix files.

Should they match?

Thank you!

almasaeed2010 commented 5 years ago

I believe they should match. Selecting a sequence type in the expression loader only helps the loader identify the gene and does not alter the type.

martacds commented 5 years ago

Ok, thank you! I'm going to try it with region instead and will update you on the result.

martacds commented 5 years ago

Tried uploading the expression data with the sequence type region (equal to the fasta file) and it still shows the same error ERROR: The feature, LOC111983025, found in the expression file was not found in the Chado database. Please ensure that the feature has been loaded into the database and that the feature name is both unique and correct.

Tried again with the different matrix files and name/unique name. Still the same error.

almasaeed2010 commented 5 years ago

One more thing to check, which option did you select for Name Match Type?

almasaeed2010 commented 5 years ago

Also if you have access to the command line, can you run this?

drush sqlc

Then run this query:

select cvt.name, f.name, f.uniquename from chado.feature f
   inner join chado.cvterm cvt on cvt.cvterm_id = f.type_id
   where f.name = 'LOC111983025' or f.uniquename = 'LOC111983025';

martacds commented 5 years ago

One more thing to check, which option did you select for Name Match Type?

I have tried with both name and unique name.

Then run this query:

This was the result Capture5

almasaeed2010 commented 5 years ago

Ok so according the results from the query, you need to select Name for Name Match type and use gene in Sequence Type. Did you try that combo?

almasaeed2010 commented 5 years ago

If that still doesn't work, please share your matrix file so we can further look into it.

martacds commented 5 years ago

Yes, I have used that combo.

My matrix files look like this (opened in Excel): With geneXXXX Capture6

With LOCXXX Capture7

almasaeed2010 commented 5 years ago

From what I can tell, you need to specify Name for and Name Match Type when using the LOCXXX file and Uniquename when using the geneXXX file. Both files should use gene as a Sequence Type.

I looked at the code and those are the only 2 conditions that must match. Please verify that this is what you did. If it still doesn't work, I'll need to be able to download both the Fasta file and the matrix files to into it.

martacds commented 5 years ago

Yes, I have tried both of those combinations.

The fasta file is here: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/906/115/GCF_002906115.1_CorkOak1.0/GCF_002906115.1_CorkOak1.0_genomic.fna.gz

The gff here: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/906/115/GCF_002906115.1_CorkOak1.0/GCF_002906115.1_CorkOak1.0_genomic.gff.gz

And the matrix file with geneXXX: countsERR490.txt

almasaeed2010 commented 5 years ago

Thanks! I'll try to look into those later today and get back to you.

martacds commented 5 years ago

Thank you so much!

martacds commented 5 years ago

Hi @almasaeed2010 Did you get a chance to check the files?

Thank you

almasaeed2010 commented 5 years ago

Hello @marcsilvaitqb

Sorry it's taking a little longer to debug this issue. The files are large and are taking a while to load into my site. I'll try to trim them down then try again.

Thanks

martacds commented 5 years ago

Oh ok, no problem!

almasaeed2010 commented 5 years ago

I tried trimming the files into a single feature each to make it fast and it worked for me as you can see below:

It's likely that there is a mismatch in the options somewhere. I would try again making sure of all of the following:

That you have selected the correct organism (this is critical)
That you have selected the Matrix format
That you are using the file with geneXXXX for the counts
That you are using gene as the sequence type
And that you are using Uniquename for the name type

So far the only thing we have not verified in the past is the organism so could you run this query?

select o.genus, o.species, cvt.name, f.name, f.uniquename from chado.feature f
   inner join chado.cvterm cvt on cvt.cvterm_id = f.type_id
   inner join chado.organism o on o.organism_id = f.organism_id
   where f.name = 'LOC111983025' or f.uniquename = 'LOC111983025';

martacds commented 5 years ago

I have ran the query and the output is: Quercus suber gene LOC111983025 gene27663

almasaeed2010 commented 5 years ago

Thanks for running the query! I hope it works this time when selecting all the parameters

martacds commented 5 years ago

So I just ran for the first time the whole thing to the end (I always cancelled halfway through) and in the end it shows the following error:

SQLSTATE[23503]: Foreign key violation: 7 ERROR: insert or update on table "element" violates foreign key constraint "element_feature_id_fkey" DETAIL: Key (feature_id)=(0) is not present in table "feature". [site http://default] [TRIPAL ERROR] [TRIPAL_JOB] SQLSTATE[23503]: Foreign key violation: 7 ERROR: insert or update on table "element" violates foreign key constraint "element_feature_id_fkey"DETAIL: Key (feature_id)=(0) is not present in table "feature".

Is this relevant?

(I have also tried again that combination and again the same error of feature not found)

almasaeed2010 commented 5 years ago

I've uploaded a new fix to address the error you've encountered. I also adjusted the code that checks if a feature is available. Could you please update the module and try again?

Thanks for your patience!

martacds commented 5 years ago

Sorry for the basic question but I'm not the one that installed the module, so I'm a bit lost. How do I update it?

almasaeed2010 commented 5 years ago

No problem!

You can navigate to the module directory from the drupal's root installation: cd sites/all/modules/tripal_analysis_expression then run git pull && drush updatedb

martacds commented 5 years ago

Thank you so much!

Will update and re-try, and then update you on the outcome.

martacds commented 5 years ago

Hi, This is what shows up now: Capture8

Options: geneXXXX, organism: quercus suber, sequence type: gene, file type: matrix, name match type: unique name

almasaeed2010 commented 5 years ago

is gene4 the only feature showing this error?

martacds commented 5 years ago

No. I just created a mini version of the matrix file with only gene 3 and gene 4, and they both show the error.

almasaeed2010 commented 5 years ago

I think I found the error this time! Could you please update the code and try again? It looks like it was checking name no matter what you chose for name match type.

martacds commented 5 years ago

i am currently running it until the end with the full matrix file. So far two things show up: Capture9 Capture10

but they don't show up for all genes, it is skipping a few.

As soon as the loader finished I'll update again.

almasaeed2010 commented 5 years ago

Can you provide your system information? Just to make sure we are using the same APIs.

Tripal version Drupal version PHP version

For drupal and php you can visit domain.org/admin/reports/status. For tripal, the version should be in the modules page.

martacds commented 5 years ago

Drupal 7.65 PHP 7.2.17-0ubuntu0.18.04.1 Tripal 7.x-3.1

martacds commented 5 years ago

I think that the Tripal Warnings might be due to the fact that some features can be recognized as pseudogenes. After this finishes, I will create the content pseudogenes and see if the same message appears.

EDIT: Adding the Tripal Content Type "pseudogene" did not fix the previous messages.

almasaeed2010 commented 5 years ago

Since the warning is not showing up for all features, let's check these 2 particular genes to make sure they have the right info in the database:

select o.genus, o.species, cvt.name, f.name, f.uniquename from chado.feature f
   inner join chado.cvterm cvt on cvt.cvterm_id = f.type_id
   inner join chado.organism o on o.organism_id = f.organism_id
   where f.name in ('gene837', 'gene846') or f.uniquenename in ('gene837', 'gene846');

martacds commented 5 years ago

This is what shows up: Capture12

But I have added the tripal content type pseudogene: Capture13

almasaeed2010 commented 5 years ago

Ok this makes sense. Since you specified the Sequence Type to be gene the importer will only look for features that have the type gene and not pseudogene. The loader expects a new matrix file for each type of feature. I didn't design this module so I am not entirely sure why this restriction is required but I can look into it on Wendesday when I have a meeting with the other developers.

martacds commented 5 years ago

So in theory if I upload the exact same matrix file but then select pseudogene as the sequence type, the data will be loaded to those features and output an error for the gene data? I was working on the assumption that the unique name was recognized regardless so I didn't bother dividing the features.

I'm only interested in the genes right now, so I have all I need for now. Thank you so much for your time and patience!

almasaeed2010 commented 5 years ago

Your theory is correct.

And anytime!

almasaeed2010 commented 5 years ago

Since this issue resulted in fixing a bug, I'll add you as a contributor for bug reporting. Thanks!

@all-contributors add @marcsilvaitqb for bug

allcontributors[bot] commented 5 years ago

@almasaeed2010

I could not determine your intention.

Basic usage: @all-contributors please add @jakebolam for code, doc and infra

For other usages see the documentation

almasaeed2010 commented 5 years ago

Does it want please? 😅

@all-contributors please add @marcsilvaitqb for bug

allcontributors[bot] commented 5 years ago

@almasaeed2010

I've put up a pull request to add @marcsilvaitqb! :tada:

tripal / tripal_analysis_expression

Expression data loader - Error (feature loading) #295