vrmarcelino / CCMetagen

Microbiome classification pipeline
GNU General Public License v3.0
64 stars 19 forks source link

Unable to construct database following your instruction #36

Closed liuchen92 closed 2 years ago

liuchen92 commented 2 years ago

Hi,

I follow your instruction as below to process recently downloaded nt and nucl_gb.accession2taxid data.

cut -f 2-3 nucl_gb.accession2taxid > accession_taxid_nucl.map
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < nt.fa > nt_sequential.fa

Then used the rename.py to get the nt_w_taxid.fas used for kma database generation.

But when I tried to build a kma database using nt_w_taxid.fas via code like this kma index -i nt_w_taxid.fas -o kma

It threw an error "unsupported file format"

And when I applied the rename.py script on nt directly instead of nt_sequential.fa , the kma database building procedure worked nicely.

Have no clues about whether converting the genbank fasta file to sequencial fasta is necessary or not.

liuchen92 commented 2 years ago

Additionally, the pre-indexed database downloaded via wget -c 'https://cloudstor.aarnet.edu.au/plus/s/vfKH9S8c5FVGBjV/download?path=%2F&files=ncbi_nt_no_env_11jun2019.zip' was not able to be unzipped with error: invalid compressed data to inflate file

vrmarcelino commented 2 years ago

Hi!

Could you have a look at the header of the file you are trying to index? Either post an example or send part of the file to my email. Maybe the awk version did something to it that it wasn't supposed to?

I was not able to reproduce the unzipping error with the ncbi_nt_no_env_11jun2019.zip file. Your download might have failed, could you try again?

Thanks!

liuchen92 commented 2 years ago

Hi!

Could you have a look at the header of the file you are trying to index? Either post an example or send part of the file to my email. Maybe the awk version did something to it that it wasn't supposed to?

I was not able to reproduce the unzipping error with the ncbi_nt_no_env_11jun2019.zip file. Your download might have failed, could you try again?

Thanks!

Thanks.

I solved the zip problems and it might due to changing name (I changed the name 'download?path=%2F&files=ncbi_nt_no_env_11jun2019.zip' to ncbi_nt_no_env_11jun2019.zip) .

The awk seems not necessary as the rename.py worked perfectly on nt database.