Open dutchscientist opened 7 years ago
That's not available at download time, and pretty tricky to hack in afterwards, because looking at the FASTA header of course only works for the FASTA files, and the code doesn't care what file format it's downloading.
Would a symlink with the --human-readable
option work for you?
I was thinking along the lines of what is proposed in #15
@dutchscientist See my comment https://github.com/kblin/ncbi-genome-download/issues/15#issuecomment-322814530
This is a bit hackish but works without modifying the ngd
package. If you invoke the --human-readable
option, you'll get ./human_readable/{genbank,refseq}/<Domain>/<Genus>/<Species epithet>/<Strain>/<Accession>.gbff.gz
so you can make symlinks off of those path names. Here's an example of how I made relative symlinks to GenBank files of refseq bacteria. Modify the awk print part to name as you like.
ngd --genus "Aestuariispira,Azospirillum" --human-readable -o from_NCBI --verbose bacteria
cd from_NCBI
mkdir human_readable_long
for f in human_readable/refseq/bacteria/*/*/*/*.gbff.gz; do
accn=$(basename $f | cut -d _ -f 1,2);
long_name=$(echo $f | awk -v var="${accn}" -F'/' '{print $(NF-3)"_"$(NF-2)"_"$(NF-1)"_("var").gbff.gz"}');
ln -sv ../"${f}" human_readable_long/"${long_name}";
done
The structure here is simple so it may be inappropriate for large projects-- all symlinks in a new subdir:
'human_readable_long/Aestuariispira_insulae_CECT_8488_(GCF_003385955.1).gbff.gz' -> ../human_readable/refseq/bacteria/Aestuariispira/insulae/CECT_8488/GCF_003385955.1_ASM338595v1_genomic.gbff.gz
'human_readable_long/Azospirillum_brasilense_FP2_(GCF_000404045.1).gbff.gz' -> ../human_readable/refseq/bacteria/Azospirillum/brasilense/FP2/GCF_000404045.1_ASM40404v1_genomic.gbff.gz
I don’t know if this is useful, since it would require running over all the downloaded sequences again, but I’ve been tinkering with this:
A tool for conversion from assorted NCBI Accessions to NCBI Taxonomy IDs https://github.com/jrjhealey/PYlogeny
The Idea being to throw accessions at Entrez and get back a full taxonomic breakdown of the accession number. I haven’t tried it with genomes, but it’s just using the ETE3 mechanism in the contrib/gimme_taxa.py
. At the moment it’s super experimental, and I’ll need to modify it slightly to deal with non-RefSeq stuff, but is that what you had in mind OP?
Currently all genomes are downloaded as cryptic filenames, such as: "GCF_000469325.1.fna"
De FASTA header of that file is: "NZ_KI271582.1 Lactobacillus shenzhenensis LY-73 genomic scaffold LY73.Scaffold1, whole genome shotgun sequence"
Is it possible that ncbi-genome-download also makes a list of filename + descriptor?
Example: GCF_000469325.1 NZ_KI271582.1 Lactobacillus shenzhenensis LY-73 ...
GCF_000967245.1 NZ_KQ033877.1 Lactobacillus mellis strain Hon2 ...
etc
I am sure I can create something like that myself, but for linux-novices (as I am a bit) this would really enhance the tool :+1: