sara-javadzadeh / FastViFi

Detect viral infection and integration sites on NGS input. Manuscript is in preparation.
GNU General Public License v3.0
9 stars 2 forks source link

Failing to build custom database for HBV #9

Open wskang1202 opened 1 year ago

wskang1202 commented 1 year ago

Hi Sara,

I've been trying to build custom databases by following FastViFi Readme. Building databases for HCV and EBV were successful, however, building the hbv databases for k=18, k-22 were unsuccessful. The following message was shown in the log file:

scan_fasta_file.pl: unable to determine taxonomy ID for sequence hbv_ref7 No preliminary seqid/taxid mapping files found, aborting.

Is there a way to solve this problem?

Best, Wonseok

sara-javadzadeh commented 1 year ago

Hi Wonseok,

It looks like the file prelim_map.txt is missing. Does the file exist in the kraken2/<your HBV db name>/taxonomy directory? If not, one reason could be that downloading the library failed. Could you please run download_custom_kraken_library.sh for HBV again and check if the prelim_map.txt file is downloaded in your HBV database directory?

Please let me know if this didn't work for you.

Best, Sara

wskang1202 commented 1 year ago

Hi, Sara,

I ran download_custom_kraken_library.sh for HBV again, and I can see that there is prelim_map.txt file in kraken2/Kraken2StandardDB_k_18_hbv/taxonomy but the file itself is empty.

Best, Wonseok

sara-javadzadeh commented 1 year ago

Hi Wonseok,

Do you get an error when running download_custom_kraken_library.sh for the HBV dataset? Could you please check if the prelim_map.txt is present and non-empty in the HCV and EBV databases that you created successfully before?

Best, Sara

wskang1202 commented 1 year ago

Hi, Sara.

The prelim_map.txt is present and non-empty in the successfully-made databases (HCV and EBV as well as k_25_hbv_hg databases). However the file is empty for the unsuccessful k_18_hbv and k_22_hbv databases. I've attached the log.txt file in case you might want to check out.

Thank you, Wonseok

sara-javadzadeh commented 1 year ago

Hi Wonseok,

Did you try running the build_custom_kraken_index.sh script on k_18_hbv database, after running download_custom_kraken_library.sh? If so, was there any error?

mrzResearchArena commented 1 year ago

Hi Javadzadeh,

I have downloaded your suggested dataset for sample-level FastFiVi for the HPV virus: https://drive.google.com/file/d/1QYn5lDWjvhtIWCrwmzDc_1fy8ANrXWz1/view?usp=sharing. However, when I was attempting to extract it, it showed errors (tar -xzvf kraken_datasets.tar.gz). Could you please suggest to me how I can figure it out?

sara-javadzadeh commented 1 year ago

Hi Muhammod,

Thanks for reaching out. Could you please share the error messages when running tar -xzvf kraken_datasets.tar.gz?

mrzResearchArena commented 1 year ago

Hi Javadzadeh,

Thank you so much for your response. I was getting the below errors. The downloaded file size is "15796400321" bytes.

gzip: stdin: invalid compressed data--crc error
tar: Child returned status 1
tar: Error is not recoverable: exiting now
ls -l kraken_datasets.tar.gz 
sara-javadzadeh commented 1 year ago

Hi again,

Thanks! Although the output to ls -l command is truncated in your reply, I can see the file size in your text above. The file size seems correct.

Did you try running gunzip kraken_datasets.tar.gz and then running tar -xvf kraken_datasets.tar? If it's failing, could you please share the error?

By the way, the uncompressed should be about 60GB. Is that taken into consideration?

Thanks, Sara

mrzResearchArena commented 1 year ago

Hi Javadzadeh,

Yes, I have tried. However, it doesn't work out.

gzip: kraken_datasets.tar.gz: invalid compressed data--crc error
mrzResearchArena commented 1 year ago

Hi Javazadesh, could you please provide a different download link?

sara-javadzadeh commented 1 year ago

I can provide another link, it'll take a couple of hours to upload the database. In the meantime, could you please check the following?

  1. Could you please share the output of the following command? file kraken_datasets.tar.gz
  2. Check if tar -tf kraken_datasets.tar.gz can list the files without the error or not. If an error, could you please share it?

Sara

mrzResearchArena commented 1 year ago

Yes, it shows errors. You can view the error by clicking the link.

tar -tf kraken_datasets.tar.gz > errors-text.txt

Output:

kraken_datasets/
kraken_datasets/Kraken2StandardDB_k_22_hpv/
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/readme.txt
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/merged.dmp
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/taxdump.tar.gz
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/names.dmp
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/taxdump.untarflag
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/accmap.dlflag
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/delnodes.dmp
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/citations.dmp
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/nodes.dmp
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/nucl_gb.accession2taxid
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/gc.prt
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/nucl_wgs.accession2taxid
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/division.dmp
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/gencode.dmp
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/taxdump.dlflag
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxonomy/prelim_map.txt
kraken_datasets/Kraken2StandardDB_k_22_hpv/seqid2taxid.map
kraken_datasets/Kraken2StandardDB_k_22_hpv/hash.k2d
kraken_datasets/Kraken2StandardDB_k_22_hpv/taxo.k2d
kraken_datasets/Kraken2StandardDB_k_22_hpv/library/
kraken_datasets/Kraken2StandardDB_k_22_hpv/library/added/
kraken_datasets/Kraken2StandardDB_k_22_hpv/library/added/prelim_map.txt
kraken_datasets/Kraken2StandardDB_k_22_hpv/library/added/9TbkQmfdkG.fna.masked
kraken_datasets/Kraken2StandardDB_k_22_hpv/library/added/9TbkQmfdkG.fna
kraken_datasets/Kraken2StandardDB_k_22_hpv/library/added/prelim_map_3IwJCtpJpX.txt
kraken_datasets/Kraken2StandardDB_k_22_hpv/opts.k2d
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/taxo.k2d
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/human/
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/human/prelim_map.txt
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/human/assembly_summary.txt
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/human/library.fna.masked
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/human/library.fna
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/human/manifest.txt
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/added/
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/added/prelim_map_SeYmVYHiCd.txt
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/added/prelim_map.txt
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/added/rKtNPyn11J.fna
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/library/added/rKtNPyn11J.fna.masked
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/opts.k2d
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/taxonomy/
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/taxonomy/gencode.dmp
kraken_datasets/Kraken2StandardDB_k_25_hpv_hg/taxonomy/nucl_wgs.accession2taxid
7486\t47861343\nAG288467\tAG288467.1\t57486\t47861344\nAG288468\tAG288468.1\t57486\t47861345\nAG288469\tAG28846
639\t112961221\nDQ844259\tDQ844259.1\t1639\t112961224\nDQ844260\tDQ844260.1\t1639\t112961227\nDQ844261\tDQ84426
253\t113251528\nED394649\tED394649.1\t6253\t113251529\nED394650\tED394650.1\t6253\t113251530\nED394651\tED39465
322560303\nJG336704\tJG336704.1\t30301\t322560304\nJG336705\tJG336705.1\t30301\t322560305\nJG336706\tJG336706.
697\nKR112558\tKR112558.1\t1387109\t955261699\nKR112559\tKR112559.1\t1690892\t955261701\nKR112560\tKR112560.1\t
0\t1531460990\nMM160627\tMM160627.1\t0\t1531460991\nMM160628\tMM160628.1\t0\t1531460992\nMM160629\tMM160629.1\t0
61476\t1946114713\nOC673268\tOC673268.1\t61476\t1946114714\nOC673269\tOC673269.1\t61476\t1946114715\nOC673270\t
\t61472\t1948381426\nOD593408\tOD593408.1\t61472\t1948381428\nOD593409\tOD593409.1\t61472\t1948381430\nOD593410
61472\t1947471274\nOD855123\tOD855123.1\t61472\t1947471275\nOD855124\tOD855124.1\t61472\t1947471276\nOD855125\t
\t61474\t1962876452\nOE366104\tOE366104.1\t61474\t1962876453\nOE366105\tOE366105.1\t61474\t1962876454\nOE366106
61474\t1964446754\nOE507499\tOE507499.1\t61474\t1964446757\nOE507500\tOE507500.1\t61474\t1964446760\nOE507501\t
\t61474\t1965131656\nOE607256\tOE607256.1\t61474\t1965131659\nOE607257\tOE607257.1\t61474\t1965131662\nOE607258
003024004.1\t663202\t302664848\nXM_003024005\tXM_003024005.1\t663202\t302664850\nXM_003024006\tXM_003024006.
tar: Skipping to next header
tar: Archive contains ‘9.1\t5748’ where numeric mode_t value expected
tar: Archive contains ‘.1\t57486\t478’ where numeric time_t value expected
7486\t47861343\nAG288467\tAG288467.1\t57486\t47861344\nAG288468\tAG288468.1\t57486\t47861345\nAG288469\tAG28846
tar: Skipping to next header
tar: Archive contains ‘0672.1\t4113\t’ where numeric off_t value expected
tar: Archive contains ‘119.1\t262687’ where numeric off_t value expected
tar: Archive contains ‘1.1\t1639’ where numeric mode_t value expected
tar: Archive contains ‘.1\t1639\t1129’ where numeric time_t value expected
tar: Archive contains ‘\t1129612’ where numeric uid_t value expected
639\t112961221\nDQ844259\tDQ844259.1\t1639\t112961224\nDQ844260\tDQ844260.1\t1639\t112961227\nDQ844261\tDQ84426
tar: Skipping to next header
tar: Archive contains ‘1.1\t6253’ where numeric mode_t value expected
tar: Archive contains ‘.1\t6253\t1132’ where numeric time_t value expected
253\t113251528\nED394649\tED394649.1\t6253\t113251529\nED394650\tED394650.1\t6253\t113251530\nED394651\tED39465
tar: Skipping to next header
tar: Archive contains ‘1609\nEZ97768’ where numeric off_t value expected
tar: Archive contains ‘\tHE793950.1\t’ where numeric off_t value expected
322560303\nJG336704\tJG336704.1\t30301\t322560304\nJG336705\tJG336705.1\t30301\t322560305\nJG336706\tJG336706.
tar: Skipping to next header
tar: Archive contains ‘1759748\t’ where numeric mode_t value expected
tar: Archive contains ‘95526170’ where numeric uid_t value expected
697\nKR112558\tKR112558.1\t1387109\t955261699\nKR112559\tKR112559.1\t1690892\t955261701\nKR112560\tKR112560.1\t
tar: Skipping to next header
tar: Archive contains ‘\tLA487646.1\t’ where numeric off_t value expected
tar: Archive contains ‘29\tMC492929.’ where numeric off_t value expected
tar: Archive contains ‘31460994\nMM1’ where numeric time_t value expected
tar: Archive contains ‘993\nMM16’ where numeric uid_t value expected
0\t1531460990\nMM160627\tMM160627.1\t0\t1531460991\nMM160628\tMM160628.1\t0\t1531460992\nMM160629\tMM160629.1\t0
tar: Skipping to next header
tar: Archive contains ‘_019029293.1’ where numeric off_t value expected
tar: Archive contains ‘\t50390\t15815’ where numeric off_t value expected
tar: Archive contains ‘OC673270’ where numeric mode_t value expected
tar: Archive contains ‘\tOC673271.1\t’ where numeric time_t value expected
tar: Archive contains ‘.1\t61476’ where numeric uid_t value expected
tar: Archive contains ‘\t1946114’ where numeric gid_t value expected
61476\t1946114713\nOC673268\tOC673268.1\t61476\t1946114714\nOC673269\tOC673269.1\t61476\t1946114715\nOC673270\t
tar: Skipping to next header
tar: Archive contains ‘\tOD59341’ where numeric mode_t value expected
tar: Archive contains ‘0.1\t6147’ where numeric uid_t value expected
\t61472\t1948381426\nOD593408\tOD593408.1\t61472\t1948381428\nOD593409\tOD593409.1\t61472\t1948381430\nOD593410
tar: Skipping to next header
tar: Archive contains ‘OD855125’ where numeric mode_t value expected
tar: Archive contains ‘\tOD855126.1\t’ where numeric time_t value expected
tar: Archive contains ‘.1\t61472’ where numeric uid_t value expected
tar: Archive contains ‘\t1947471’ where numeric gid_t value expected
61472\t1947471274\nOD855123\tOD855123.1\t61472\t1947471275\nOD855124\tOD855124.1\t61472\t1947471276\nOD855125\t
tar: Skipping to next header
tar: Archive contains ‘\tOE36610’ where numeric mode_t value expected
tar: Archive contains ‘6.1\t6147’ where numeric uid_t value expected
\t61474\t1962876452\nOE366104\tOE366104.1\t61474\t1962876453\nOE366105\tOE366105.1\t61474\t1962876454\nOE366106
tar: Skipping to next header
tar: Archive contains ‘OE507501’ where numeric mode_t value expected
tar: Archive contains ‘\tOE507502.1\t’ where numeric time_t value expected
tar: Archive contains ‘.1\t61474’ where numeric uid_t value expected
tar: Archive contains ‘\t1964446’ where numeric gid_t value expected
61474\t1964446754\nOE507499\tOE507499.1\t61474\t1964446757\nOE507500\tOE507500.1\t61474\t1964446760\nOE507501\t
tar: Skipping to next header
tar: Archive contains ‘081\nOE597102’ where numeric off_t value expected
tar: Archive contains ‘\tOE60725’ where numeric mode_t value expected
tar: Archive contains ‘9\tOE607259.1’ where numeric time_t value expected
tar: Archive contains ‘8.1\t6147’ where numeric uid_t value expected
\t61474\t1965131656\nOE607256\tOE607256.1\t61474\t1965131659\nOE607257\tOE607257.1\t61474\t1965131662\nOE607258
tar: Skipping to next header
tar: Archive contains ‘03024007.1\t6’ where numeric time_t value expected
tar: Archive contains ‘\t3026648’ where numeric uid_t value expected
003024004.1\t663202\t302664848\nXM_003024005\tXM_003024005.1\t663202\t302664850\nXM_003024006\tXM_003024006.
tar: Skipping to next header
tar: Archive contains ‘008481066.2\t’ where numeric off_t value expected

gzip: stdin: invalid compressed data--crc error

gzip: stdin: invalid compressed data--length error
tar: Child returned status 1
tar: Error is not recoverable: exiting now
sara-javadzadeh commented 1 year ago

Thanks for checking. I'm uploading the databases again, it'll take another couple of hours to fully upload. I'll share the link here as soon as it happens. In the meantime, it might be worth setting up a new Conda environment, installing tar and trying to extract the database files in this new clean environment. Let me know if you still get the errors.

Sara

sara-javadzadeh commented 1 year ago

Hi again,

Here's a second link for the same Kraken databases: https://drive.google.com/file/d/1DrKgDE7fl5Tff2bV8K9XBxLYsbTeOcgh/view?usp=sharing

I suspect this might be a tar library incompatibility rather than file problem. I was able to list the contents of kraken_datasets.tar.gz using the first link (provided in the README file). Here's my tar version on macOS 12.1 tar --version bsdtar 3.5.1 - libarchive 3.5.1 zlib/1.2.11 liblzma/5.0.5 bz2lib/1.0.8 That's why I would recommend updating your tar package or create a new Conda environment and try it again as above. Let me know how it goes.

Sara

mrzResearchArena commented 1 year ago

Thank you, Ms. Javadzadeh. It helped me a lot.

I used a Python script instead of tar, and this time, it has not shown errors. After extracting, I got a 61.4 GB file size. Is it the correct file size?

import tarfile

sourcePATH = '/mnt/sdb1/kraken2/kraken_datasets.tar.gz'
destinationPATH = '/mnt/sdb1/kraken2/'

with tarfile.open(sourcePATH) as tar:
    tar.extractall(destinationPATH)
    tar.close()
sara-javadzadeh commented 1 year ago

Great! Thanks for letting me know. The size of extracted files sound reasonable.

Sara

On Wed, Jun 14, 2023 at 9:47 AM Rafsanjani, Muhammod < @.***> wrote:

Thank you, Ms. Javadzadeh. It helped me a lot.

I used a Python script instead of tar, and this time, it has not shown errors. After extracting, I got a 61.4 GB file size. Is it the correct file size?

import tarfile sourcePATH = '/mnt/sdb1/kraken2/kraken_datasets.tar.gz'destinationPATH = '/mnt/sdb1/kraken2/' with tarfile.open(sourcePATH) as tar: tar.extractall(destinationPATH) tar.close()

— Reply to this email directly, view it on GitHub https://github.com/sara-javadzadeh/FastViFi/issues/9#issuecomment-1591638725, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOGKYDQYTPYP3TJYW3C5L53XLHTJVANCNFSM6AAAAAAYIFHEZI . You are receiving this because you commented.Message ID: @.***>

cubense commented 7 months ago

Hi Wonseok,

Did you try running the build_custom_kraken_index.sh script on k_18_hbv database, after running download_custom_kraken_library.sh? If so, was there any error?

hi sara i meet the same error with Wonseok
no error running the build_custom_kraken_index.sh and download_custom_kraken_library.sh in k_18 and k_22 prelim_map.txt is empty in k_18 and k_22 but prelim_map.txt in k_25_hg is ok i running the docker show error does not contain necessary file taxo.k2d