Open svengato opened 8 months ago
Clarification: added to annotations on lupini-mines, not in the data store.
I am seeing this problem in the new (HEN17-A07) tripr annotations as well. Should we clean up the Vicia and Trifolium annotations before proceeding further?
I think the new tripr and also vicvi annotations may not have been subjected to AHRD yet. Technically they probably shouldn't have been put into the public area of the datastore before that happened, but it's been a long pause for them while @StevenCannon-USDA worked out the process for handling NCBI files, so likely one of us just forgot about it. I'll try to sort that out, but yes holding off for now is advisable.
Actually, it looks like I did run AHRD, but never copied the files over to the datastore proper. However, there is still at least one loose end (gene family assignment) before intermine loading can take place. Running that now will let you know when things are ready (should be later today but that need not chain you to your keyboard)
OK, I think all the vicia annotations and the new tripr annotations should be ready for the mines, but let me know if you run into anything further.
It looks like the Hedin2.gnm1.ann1.PTNK annotations did get rsynced to falafel, but I see the same problem.
Retrieving Hedin2.gnm1.ann1.PTNK in a tgt items database
[convertFile] --------------------------------------------------------------------------------
[convertFile] ## Validating vicfa collection Hedin2.gnm1.ann1.PTNK
[convertFile] - vicfa.Hedin2.gnm1.ann1.PTNK.gene_models_main.gff3.gz
[convertFile] x vicfa.Hedin2.gnm1.ann1.PTNK.gene_models_main.gff3.gz 14720 gene record Note attributes are missing GO terms.
[convertFile] - vicfa.Hedin2.gnm1.ann1.PTNK.protein.faa.gz
[convertFile] - vicfa.Hedin2.gnm1.ann1.PTNK.protein_primary.faa.gz
[convertFile] - vicfa.Hedin2.gnm1.ann1.PTNK.cds.fna.gz
[convertFile] - vicfa.Hedin2.gnm1.ann1.PTNK.cds_primary.fna.gz
[convertFile] - vicfa.Hedin2.gnm1.ann1.PTNK.mrna.fna.gz
[convertFile] - vicfa.Hedin2.gnm1.ann1.PTNK.mrna_primary.fna.gz
[convertFile] - vicfa.Hedin2.gnm1.ann1.PTNK.iprscan.gff3.gz
[convertFile] - vicfa.Hedin2.gnm1.ann1.PTNK.legfed_v1_0.M65K.gfa.tsv.gz
[convertFile] x optional phytozome_10_2.HFNR.gfa.tsv.gz file is not present.
[convertFile] ## Processing README.Hedin2.gnm1.ann1.PTNK.yml
[convertFile] ## Processing README.Hedin2.gnm1.06GS.yml
[convertFile] x skipping vicfa.Hedin2.gnm1.ann1.PTNK.cds.bed.gz
[convertFile] ## Processing vicfa.Hedin2.gnm1.ann1.PTNK.cds.fna.gz
[convertFile] ## Processing vicfa.Hedin2.gnm1.ann1.PTNK.cds_primary.fna.gz
[convertFile] x skipping vicfa.Hedin2.gnm1.ann1.PTNK.featid_map.tsv.gz
[convertFile] x skipping vicfa.Hedin2.gnm1.ann1.PTNK.gene_models_main.bed.gz
[convertFile] ## Processing vicfa.Hedin2.gnm1.ann1.PTNK.gene_models_main.gff3.gz
> Task :dbmodel:integrateMultipleSources FAILED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':dbmodel:integrateMultipleSources'.
> java.lang.RuntimeException: Exon vicfa.Hedin2.gnm1.ann1.1g215080.2-exon-1 parent mRNA vicfa.Hedin2.gnm1.ann1.1g215080.2 <has not yet been loaded. Is the GFF sorted?
OK, looks like I didn't get things quite right when it comes to genes having both coding and non-coding children. Will fix
OK, I think it should be resolved now, but let me know if not.
The vicvi part of the build failed:
[...]
[convertFile] ## Processing vicvi.HV-30.gnm1.ann1.6WFF.protein_primary.faa.gz
> Task :dbmodel:integrateMultipleSources FAILED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':dbmodel:integrateMultipleSources'.
> java.lang.NullPointerException
Not much information, but I will look into it.
There were also many instances of the following error (for different ctgs), I can probably just add "ctg" to the supercontig_prefix.
[convertFile] ### feature ignored on sequence vicvi.HV-30.gnm1.ctg.15230_1_1 because not recognized as chromosome or supercontig.
I can probably just add "ctg" to the supercontig_prefix[es].
I did not think the NullPointerException was due to the ctgs, but now it gets farther (I am building a vicvi-only mine to save time).
I tried to build LensMine 5.1.0.4 and got the same error, which suggests that we need to eliminate the non-coding genes for all species.
[convertFile] ## Validating lencu collection CDC_Redberry.gnm2.ann1.5FB4
[convertFile] - lencu.CDC_Redberry.gnm2.ann1.5FB4.gene_models_main.gff3.gz
[convertFile] ## INVALID: lencu.CDC_Redberry.gnm2.ann1.5FB4.gene_models_main.gff3.gz 637 gene records are missing the Note attribute.
[convertFile] x lencu.CDC_Redberry.gnm2.ann1.5FB4.gene_models_main.gff3.gz 16164 gene record Note attributes are missing GO terms.
[convertFile] - lencu.CDC_Redberry.gnm2.ann1.5FB4.protein.faa.gz
[convertFile] - lencu.CDC_Redberry.gnm2.ann1.5FB4.cds.fna.gz
[convertFile] - lencu.CDC_Redberry.gnm2.ann1.5FB4.mrna.fna.gz
[convertFile] - lencu.CDC_Redberry.gnm2.ann1.5FB4.iprscan.gff3.gz
[convertFile] - lencu.CDC_Redberry.gnm2.ann1.5FB4.legfed_v1_0.M65K.gfa.tsv.gz
[convertFile] x optional phytozome_10_2.HFNR.gfa.tsv.gz file is not present.
> Task :dbmodel:integrateMultipleSources FAILED
FAILURE: Build failed with an exception.
* What went wrong:
Execution failed for task ':dbmodel:integrateMultipleSources'.
> java.lang.RuntimeException: Collection /falafel/legumeinfo/rsyncd_datastore/v2/Lens/culinaris/annotations/CDC_Redberry.gnm2.ann1.5FB4 does not pass validation.
This validation happens in
/home/legumista/java/ncgr/datastore/src/main/java/org/ncgr/datastore/validation/AnnotationCollectionValidator.java
or originally from the ncgr/java
repository,
https://github.com/ncgr/java/blob/main/datastore/src/main/java/org/ncgr/datastore/validation/AnnotationCollectionValidator.java
For example, why is the limit of genes without Notes set to 100?
if (genesWithoutNotes > 100) {
printError(file.getName() + " " + genesWithoutNotes + " gene records are missing the Note attribute.");
} else if (genesWithoutNotes > 0) {
printWarning(file.getName() + " " + genesWithoutNotes + " gene records are missing the Note attribute.");
}
@svengato 100 was a fairly arbitrary choice of number that Sam and I agreed to at the time, allowing some modest level of missingness around the order of magnitude we might expect if it was due to the presence of non-coding genes, but still intended to fail validation if copious levels of missingness suggested some more dramatic cause (like someone had forgotten to run AHRD). Since we seem to have more or less decided to segregate non-coding genes, we could probably return to a 0-tolerance policy. But I do still need to apply the segregation procedure to this (and maybe other) data collection. Will let you know when try-againable
Hmm, looks like I was wrong (sort of) about the issue with the annotation file in question. There are no explicitly non-coding elements, but some genes appear to be missing Note attributes because there are no proteins for them. That is, they have children designated as mRNA but those children have not given their parent genes any CDS grandchildren. No wonder they say lentils are never boring! Not sure what the best thing to do is, but I guess this sort of thing might argue for making the validation limit something we can override rather than having it be purely hard-coded.
Would it make sense to leave the data store files as they are, but filter out non-coding genes in the integration code?
There's really nothing wrong with having non-coding genes in the mines per se, I think it would make more sense to just have an option that tells the validator to "chillax" (this being a cool season legume and all). The only reason to have the check in the first place is to guard against major snafus, but minor snafus are what this business is all about.
But I do still need to apply the segregation procedure to this (and maybe other) data collection. Will let you know when try-againable
For TrifoliumMine, I think the T. pratense MilvusB annotations and T. subterraneum Daliak annotations still need it.
ViciaMine: V. faba Tiffany annotations have no 'noncoding' file, but maybe there were no noncoding genes? (I was able to rebuild the ViciaMine database after your updates.)
I prepared some more mines for the 5.1.0.4 upgrade. LensMine and LupinusMine have the same problem with noncoding genes, CajanusMine appears not to and is still running the 'integrate' step.
Working on adding an optional maxGenesWithoutNotes
property in each mine's project.xml
to address this issue. I propose:
I found two ways to do this:
Add it to gradle.properties as a Java system property. (This does not work for either .intermine/<mine-name>.properties
or project.xml
)
## maximum number of genes without Notes
systemProp.maxGenesWithoutNotes=-1
Add it on the command line, like
./gradlew integrate --stacktrace -DmaxGenesWithoutNotes=-1
I prefer the second way as you can decide to override it on the fly instead of changing a file.
Using the second option for now - commit https://github.com/ncgr/java/commit/064ae8516064110e31c7ea7057732c0cd8970cf0.
In annotations (gff3) files, what should we do about genes missing the Note attribute? The InterMine build process expects this attribute to exist.
Workaround for experimental ViciaMine: I added ";Note=none" to the attributes column where necessary. However, @adf-ncgr says that these genes are not protein-coding and should therefore not have been included in the annotations.