Genes missing Note attribute

svengato commented 8 months ago

In annotations (gff3) files, what should we do about genes missing the Note attribute? The InterMine build process expects this attribute to exist.

Workaround for experimental ViciaMine: I added ";Note=none" to the attributes column where necessary. However, @adf-ncgr says that these genes are not protein-coding and should therefore not have been included in the annotations.

svengato commented 8 months ago

Clarification: added to annotations on lupini-mines, not in the data store.

svengato commented 5 months ago

I am seeing this problem in the new (HEN17-A07) tripr annotations as well. Should we clean up the Vicia and Trifolium annotations before proceeding further?

adf-ncgr commented 5 months ago

I think the new tripr and also vicvi annotations may not have been subjected to AHRD yet. Technically they probably shouldn't have been put into the public area of the datastore before that happened, but it's been a long pause for them while @StevenCannon-USDA worked out the process for handling NCBI files, so likely one of us just forgot about it. I'll try to sort that out, but yes holding off for now is advisable.

adf-ncgr commented 5 months ago

Actually, it looks like I did run AHRD, but never copied the files over to the datastore proper. However, there is still at least one loose end (gene family assignment) before intermine loading can take place. Running that now will let you know when things are ready (should be later today but that need not chain you to your keyboard)

adf-ncgr commented 5 months ago

OK, I think all the vicia annotations and the new tripr annotations should be ready for the mines, but let me know if you run into anything further.

svengato commented 5 months ago

It looks like the Hedin2.gnm1.ann1.PTNK annotations did get rsynced to falafel, but I see the same problem.

Retrieving Hedin2.gnm1.ann1.PTNK in a tgt items database
[convertFile] --------------------------------------------------------------------------------
[convertFile] ## Validating vicfa collection Hedin2.gnm1.ann1.PTNK
[convertFile]  - vicfa.Hedin2.gnm1.ann1.PTNK.gene_models_main.gff3.gz
[convertFile]  x vicfa.Hedin2.gnm1.ann1.PTNK.gene_models_main.gff3.gz 14720 gene record Note attributes are missing GO terms.
[convertFile]  - vicfa.Hedin2.gnm1.ann1.PTNK.protein.faa.gz
[convertFile]  - vicfa.Hedin2.gnm1.ann1.PTNK.protein_primary.faa.gz
[convertFile]  - vicfa.Hedin2.gnm1.ann1.PTNK.cds.fna.gz
[convertFile]  - vicfa.Hedin2.gnm1.ann1.PTNK.cds_primary.fna.gz
[convertFile]  - vicfa.Hedin2.gnm1.ann1.PTNK.mrna.fna.gz
[convertFile]  - vicfa.Hedin2.gnm1.ann1.PTNK.mrna_primary.fna.gz
[convertFile]  - vicfa.Hedin2.gnm1.ann1.PTNK.iprscan.gff3.gz
[convertFile]  - vicfa.Hedin2.gnm1.ann1.PTNK.legfed_v1_0.M65K.gfa.tsv.gz
[convertFile]  x optional phytozome_10_2.HFNR.gfa.tsv.gz file is not present.
[convertFile] ## Processing README.Hedin2.gnm1.ann1.PTNK.yml
[convertFile] ## Processing README.Hedin2.gnm1.06GS.yml
[convertFile]  x skipping vicfa.Hedin2.gnm1.ann1.PTNK.cds.bed.gz
[convertFile] ## Processing vicfa.Hedin2.gnm1.ann1.PTNK.cds.fna.gz
[convertFile] ## Processing vicfa.Hedin2.gnm1.ann1.PTNK.cds_primary.fna.gz
[convertFile]  x skipping vicfa.Hedin2.gnm1.ann1.PTNK.featid_map.tsv.gz
[convertFile]  x skipping vicfa.Hedin2.gnm1.ann1.PTNK.gene_models_main.bed.gz
[convertFile] ## Processing vicfa.Hedin2.gnm1.ann1.PTNK.gene_models_main.gff3.gz

> Task :dbmodel:integrateMultipleSources FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':dbmodel:integrateMultipleSources'.
> java.lang.RuntimeException: Exon vicfa.Hedin2.gnm1.ann1.1g215080.2-exon-1 parent mRNA vicfa.Hedin2.gnm1.ann1.1g215080.2 <has not yet been loaded. Is the GFF sorted?

adf-ncgr commented 5 months ago

OK, looks like I didn't get things quite right when it comes to genes having both coding and non-coding children. Will fix

adf-ncgr commented 5 months ago

OK, I think it should be resolved now, but let me know if not.

svengato commented 5 months ago

The vicvi part of the build failed:

[...]
[convertFile] ## Processing vicvi.HV-30.gnm1.ann1.6WFF.protein_primary.faa.gz

> Task :dbmodel:integrateMultipleSources FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':dbmodel:integrateMultipleSources'.
> java.lang.NullPointerException

Not much information, but I will look into it.

There were also many instances of the following error (for different ctgs), I can probably just add "ctg" to the supercontig_prefix.

[convertFile] ### feature ignored on sequence vicvi.HV-30.gnm1.ctg.15230_1_1 because not recognized as chromosome or supercontig.

svengato commented 5 months ago

I can probably just add "ctg" to the supercontig_prefix[es].

I did not think the NullPointerException was due to the ctgs, but now it gets farther (I am building a vicvi-only mine to save time).

svengato commented 4 months ago

I tried to build LensMine 5.1.0.4 and got the same error, which suggests that we need to eliminate the non-coding genes for all species.

[convertFile] ## Validating lencu collection CDC_Redberry.gnm2.ann1.5FB4
[convertFile]  - lencu.CDC_Redberry.gnm2.ann1.5FB4.gene_models_main.gff3.gz
[convertFile] ## INVALID: lencu.CDC_Redberry.gnm2.ann1.5FB4.gene_models_main.gff3.gz 637 gene records are missing the Note attribute.
[convertFile]  x lencu.CDC_Redberry.gnm2.ann1.5FB4.gene_models_main.gff3.gz 16164 gene record Note attributes are missing GO terms.
[convertFile]  - lencu.CDC_Redberry.gnm2.ann1.5FB4.protein.faa.gz
[convertFile]  - lencu.CDC_Redberry.gnm2.ann1.5FB4.cds.fna.gz
[convertFile]  - lencu.CDC_Redberry.gnm2.ann1.5FB4.mrna.fna.gz
[convertFile]  - lencu.CDC_Redberry.gnm2.ann1.5FB4.iprscan.gff3.gz
[convertFile]  - lencu.CDC_Redberry.gnm2.ann1.5FB4.legfed_v1_0.M65K.gfa.tsv.gz
[convertFile]  x optional phytozome_10_2.HFNR.gfa.tsv.gz file is not present.

> Task :dbmodel:integrateMultipleSources FAILED

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':dbmodel:integrateMultipleSources'.
> java.lang.RuntimeException: Collection /falafel/legumeinfo/rsyncd_datastore/v2/Lens/culinaris/annotations/CDC_Redberry.gnm2.ann1.5FB4 does not pass validation.

svengato commented 4 months ago

This validation happens in

/home/legumista/java/ncgr/datastore/src/main/java/org/ncgr/datastore/validation/AnnotationCollectionValidator.java

or originally from the ncgr/java repository,

https://github.com/ncgr/java/blob/main/datastore/src/main/java/org/ncgr/datastore/validation/AnnotationCollectionValidator.java

svengato commented 4 months ago

For example, why is the limit of genes without Notes set to 100?

            if (genesWithoutNotes > 100) {
                printError(file.getName() + " " + genesWithoutNotes + " gene records are missing the Note attribute.");
            } else if (genesWithoutNotes > 0) {
                printWarning(file.getName() + " " + genesWithoutNotes + " gene records are missing the Note attribute.");
            }

adf-ncgr commented 4 months ago

@svengato 100 was a fairly arbitrary choice of number that Sam and I agreed to at the time, allowing some modest level of missingness around the order of magnitude we might expect if it was due to the presence of non-coding genes, but still intended to fail validation if copious levels of missingness suggested some more dramatic cause (like someone had forgotten to run AHRD). Since we seem to have more or less decided to segregate non-coding genes, we could probably return to a 0-tolerance policy. But I do still need to apply the segregation procedure to this (and maybe other) data collection. Will let you know when try-againable

adf-ncgr commented 4 months ago

Hmm, looks like I was wrong (sort of) about the issue with the annotation file in question. There are no explicitly non-coding elements, but some genes appear to be missing Note attributes because there are no proteins for them. That is, they have children designated as mRNA but those children have not given their parent genes any CDS grandchildren. No wonder they say lentils are never boring! Not sure what the best thing to do is, but I guess this sort of thing might argue for making the validation limit something we can override rather than having it be purely hard-coded.

svengato commented 4 months ago

Would it make sense to leave the data store files as they are, but filter out non-coding genes in the integration code?

adf-ncgr commented 4 months ago

There's really nothing wrong with having non-coding genes in the mines per se, I think it would make more sense to just have an option that tells the validator to "chillax" (this being a cool season legume and all). The only reason to have the check in the first place is to guard against major snafus, but minor snafus are what this business is all about.

svengato commented 4 months ago

But I do still need to apply the segregation procedure to this (and maybe other) data collection. Will let you know when try-againable

For TrifoliumMine, I think the T. pratense MilvusB annotations and T. subterraneum Daliak annotations still need it.

ViciaMine: V. faba Tiffany annotations have no 'noncoding' file, but maybe there were no noncoding genes? (I was able to rebuild the ViciaMine database after your updates.)

svengato commented 4 months ago

I prepared some more mines for the 5.1.0.4 upgrade. LensMine and LupinusMine have the same problem with noncoding genes, CajanusMine appears not to and is still running the 'integrate' step.

svengato commented 3 months ago

Working on adding an optional maxGenesWithoutNotes property in each mine's project.xml to address this issue. I propose:

If it is missing, use the default value of 100.
If it is a positive integer, use that.
If it is -1 (or any negative integer), treat that as no limit.

svengato commented 3 months ago

I found two ways to do this:

Add it to gradle.properties as a Java system property. (This does not work for either .intermine/<mine-name>.properties or project.xml)
```
## maximum number of genes without Notes
systemProp.maxGenesWithoutNotes=-1
```

Add it on the command line, like

./gradlew integrate --stacktrace -DmaxGenesWithoutNotes=-1

I prefer the second way as you can decide to override it on the fly instead of changing a file.

svengato commented 3 months ago

Using the second option for now - commit https://github.com/ncgr/java/commit/064ae8516064110e31c7ea7057732c0cd8970cf0.

legumeinfo / mine-issues

Genes missing Note attribute #152