legumeinfo / datastore-issues

mostly for issues pertaining to the content of the legumeinfo datastore; may also relate to characteristics of its user interface or managing the mirroring process to the legfed instance

Other

1 stars 0 forks source link

New genome and annotations for Chamaecrista fasciculata (two haplotypes) #208

Open StevenCannon-USDA opened 5 months ago

StevenCannon-USDA commented 5 months ago

Main steps for adding new genome and annotation collections

Genus/species/collection names:

Haplotype 1:

Chamaecrista/fasciculata/genomes/ISC494698.gnm1.8Q19
Chamaecrista/fasciculata/annotations/ISC494698.gnm1.ann1.G7XW

Haplotype 2:

Chamaecrista/fasciculata/genomes/ISC494698.gnm1_hap2.G6BY
Chamaecrista/fasciculata/annotations/ISC494698.gnm1_hap2.ann1.WXZF
[X] Add collection(s) to the Data Store, including commits to datastore-metadata
[X] Validate the README(s)
[X] Update about_this_collection.yml
[x] Calculate AHRD functional annotations
[x] Calculate gene family assignments (.gfa)
[N/A ] Add to pan-gene set
[x] Load relevant mine
[ ] Add BLAST targets
[x] Incorporate into GCV
[ ] Update the jekyll collections listing
[x] Update browser configs
[x] run BUSCO
[x] Update DSCensor
[ ] Add LINKOUTS to datastore, refresh linkout service

StevenCannon-USDA commented 4 months ago

This one is back in play, following our discussion about handling haplotype-resolved assemblies.

adf-ncgr commented 4 months ago

@StevenCannon-USDA should have the AHRDs on these two completed soon and will move from annex to main datastore. My preference would be to move them both there since it seems like it would make sense to include them both in at least some (if not all) downstream systems. But wanted to confirm with you since I think originally you were planning to leave secondary haplotypes in the annex. Also one very minor note, it seems that the procedure you're using for the upstream processing is producing uncompressed gff3 for the gene_models_main files, although they have the .gz suffix. Not really a problem since we have to add the AHRD stuff in and redo compression/indexing but it is a bit confusing when gunzip complains...

StevenCannon-USDA commented 4 months ago

move them both there since it seems like it would make sense to include them both in at least some (if not all) downstream systems. I agree now that moving them both to the main Data Store is best.

Thanks for the alert about the uncompressed GFF3s. I suspect that was due to some additional manual stuff I did when the automated compression failed (I think) due to an interrupted session.

adf-ncgr commented 4 months ago

OK, the data content related tasks (AHRD/BUSCO/gfa) should be complete and I've moved the folders into the main datastore; downstream steps will proceed as time permits but if there's any you consider higher priority than others let me know.

Regarding the compression, it definitely was an issue on both haplotypes and I feel like I've seen it before but not %100 sure about that. In any case if I see it again I'll let you know.

StevenCannon-USDA commented 4 months ago

OK, thank you.

I'll also investigate the compression issue -- at least next time I run the process. The script responsible should be /usr/local/www/data/datastore-specifications/scripts/compress_and_index.sh and the code in question is:

for file in $filepath/*.f?a $filepath/*.gff3 $filepath/*tsv $filepath/*bed; do
  if test -f $file; then
    echo "Compressing $file"
    bgzip -l9 $file &
  fi
done
wait

adf-ncgr commented 4 months ago

well that looks pretty straightforward- but now that I think about it some more I don't think an interrupted session would explain the observed behavior which is as if the original file were simply renamed with a .gz suffix. Is it possible that there's something else that just names it with a gz extension (in which case the code above wouldn't even see it there)?

StevenCannon-USDA commented 4 months ago

"Is it possible that there's something else that just names it with a gz extension"

Helpful suggestion/clue. You are right. Here's the source of the problem. In the ds_souschef.pl configs chafa.ISC494698.gnm1_hap2.ann1.yml and chafa.ISC494698.gnm1.ann1.yml, the "to" suffix was given as gene_models_main.gff3.gz, but it should have been just gene_models_main.gff3, since the output is not gzipped by ds_souschef.pl.

  - 
    from: gene_strip.gff3.gz
    to: gene_models_main.gff3.gz
    description: "Gene models - main"

I'll plan to add checks for this in ds_souschef.pl once I've finished some other tasks.