Open StevenCannon-USDA opened 5 months ago
This one is back in play, following our discussion about handling haplotype-resolved assemblies.
@StevenCannon-USDA should have the AHRDs on these two completed soon and will move from annex to main datastore. My preference would be to move them both there since it seems like it would make sense to include them both in at least some (if not all) downstream systems. But wanted to confirm with you since I think originally you were planning to leave secondary haplotypes in the annex. Also one very minor note, it seems that the procedure you're using for the upstream processing is producing uncompressed gff3 for the gene_models_main files, although they have the .gz suffix. Not really a problem since we have to add the AHRD stuff in and redo compression/indexing but it is a bit confusing when gunzip complains...
move them both there since it seems like it would make sense to include them both in at least some (if not all) downstream systems. I agree now that moving them both to the main Data Store is best.
Thanks for the alert about the uncompressed GFF3s. I suspect that was due to some additional manual stuff I did when the automated compression failed (I think) due to an interrupted session.
OK, the data content related tasks (AHRD/BUSCO/gfa) should be complete and I've moved the folders into the main datastore; downstream steps will proceed as time permits but if there's any you consider higher priority than others let me know.
Regarding the compression, it definitely was an issue on both haplotypes and I feel like I've seen it before but not %100 sure about that. In any case if I see it again I'll let you know.
OK, thank you.
I'll also investigate the compression issue -- at least next time I run the process.
The script responsible should be
/usr/local/www/data/datastore-specifications/scripts/compress_and_index.sh
and the code in question is:
for file in $filepath/*.f?a $filepath/*.gff3 $filepath/*tsv $filepath/*bed; do
if test -f $file; then
echo "Compressing $file"
bgzip -l9 $file &
fi
done
wait
well that looks pretty straightforward- but now that I think about it some more I don't think an interrupted session would explain the observed behavior which is as if the original file were simply renamed with a .gz suffix. Is it possible that there's something else that just names it with a gz extension (in which case the code above wouldn't even see it there)?
"Is it possible that there's something else that just names it with a gz extension"
Helpful suggestion/clue. You are right.
Here's the source of the problem. In the ds_souschef.pl configs chafa.ISC494698.gnm1_hap2.ann1.yml
and chafa.ISC494698.gnm1.ann1.yml
, the "to" suffix was given as gene_models_main.gff3.gz
, but it should have been just gene_models_main.gff3
, since the output is not gzipped by ds_souschef.pl.
-
from: gene_strip.gff3.gz
to: gene_models_main.gff3.gz
description: "Gene models - main"
I'll plan to add checks for this in ds_souschef.pl once I've finished some other tasks.
Main steps for adding new genome and annotation collections
Genus/species/collection names:
Haplotype 1:
Haplotype 2:
Chamaecrista/fasciculata/genomes/ISC494698.gnm1_hap2.G6BY
Chamaecrista/fasciculata/annotations/ISC494698.gnm1_hap2.ann1.WXZF
[X] Add collection(s) to the Data Store, including commits to datastore-metadata
[X] Validate the README(s)
[X] Update about_this_collection.yml
[x] Calculate AHRD functional annotations
[x] Calculate gene family assignments (.gfa)
[N/A ] Add to pan-gene set
[x] Load relevant mine
[ ] Add BLAST targets
[x] Incorporate into GCV
[ ] Update the jekyll collections listing
[x] Update browser configs
[x] run BUSCO
[x] Update DSCensor
[ ] Add LINKOUTS to datastore, refresh linkout service