kblin / ncbi-genome-download

Scripts to download genomes from the NCBI FTP servers
Apache License 2.0
957 stars 174 forks source link

Create output of descriptors of downloaded genomes #38

Open dutchscientist opened 7 years ago

dutchscientist commented 7 years ago

Currently all genomes are downloaded as cryptic filenames, such as: "GCF_000469325.1.fna"

De FASTA header of that file is: "NZ_KI271582.1 Lactobacillus shenzhenensis LY-73 genomic scaffold LY73.Scaffold1, whole genome shotgun sequence"

Is it possible that ncbi-genome-download also makes a list of filename + descriptor?

Example: GCF_000469325.1 NZ_KI271582.1 Lactobacillus shenzhenensis LY-73 ... GCF_000967245.1 NZ_KQ033877.1 Lactobacillus mellis strain Hon2 ...

etc

I am sure I can create something like that myself, but for linux-novices (as I am a bit) this would really enhance the tool :+1:

kblin commented 7 years ago

That's not available at download time, and pretty tricky to hack in afterwards, because looking at the FASTA header of course only works for the FASTA files, and the code doesn't care what file format it's downloading. Would a symlink with the --human-readable option work for you?

kblin commented 7 years ago

I was thinking along the lines of what is proposed in #15

andrewsanchez commented 7 years ago

@dutchscientist See my comment https://github.com/kblin/ncbi-genome-download/issues/15#issuecomment-322814530

chrisgulvik commented 5 years ago

This is a bit hackish but works without modifying the ngd package. If you invoke the --human-readable option, you'll get ./human_readable/{genbank,refseq}/<Domain>/<Genus>/<Species epithet>/<Strain>/<Accession>.gbff.gz so you can make symlinks off of those path names. Here's an example of how I made relative symlinks to GenBank files of refseq bacteria. Modify the awk print part to name as you like.

ngd --genus "Aestuariispira,Azospirillum" --human-readable -o from_NCBI --verbose bacteria
cd from_NCBI
mkdir human_readable_long
for f in human_readable/refseq/bacteria/*/*/*/*.gbff.gz; do
  accn=$(basename $f | cut -d _ -f 1,2);
  long_name=$(echo $f | awk -v var="${accn}" -F'/' '{print $(NF-3)"_"$(NF-2)"_"$(NF-1)"_("var").gbff.gz"}');
  ln -sv ../"${f}" human_readable_long/"${long_name}";
done

The structure here is simple so it may be inappropriate for large projects-- all symlinks in a new subdir:


'human_readable_long/Aestuariispira_insulae_CECT_8488_(GCF_003385955.1).gbff.gz' -> ../human_readable/refseq/bacteria/Aestuariispira/insulae/CECT_8488/GCF_003385955.1_ASM338595v1_genomic.gbff.gz
'human_readable_long/Azospirillum_brasilense_FP2_(GCF_000404045.1).gbff.gz' -> ../human_readable/refseq/bacteria/Azospirillum/brasilense/FP2/GCF_000404045.1_ASM40404v1_genomic.gbff.gz
jrjhealey commented 5 years ago

I don’t know if this is useful, since it would require running over all the downloaded sequences again, but I’ve been tinkering with this:

A tool for conversion from assorted NCBI Accessions to NCBI Taxonomy IDs https://github.com/jrjhealey/PYlogeny

The Idea being to throw accessions at Entrez and get back a full taxonomic breakdown of the accession number. I haven’t tried it with genomes, but it’s just using the ETE3 mechanism in the contrib/gimme_taxa.py. At the moment it’s super experimental, and I’ll need to modify it slightly to deal with non-RefSeq stuff, but is that what you had in mind OP?