sara-javadzadeh / FastViFi

Detect viral infection and integration sites on NGS input. Manuscript is in preparation.
GNU General Public License v3.0
10 stars 2 forks source link

[E::bwa_idx_load_from_disk] fail to locate the index files #14

Open v-mukhina opened 8 months ago

v-mukhina commented 8 months ago

Hi Sara, could you please help me? My issue is probably related to this one https://github.com/sara-javadzadeh/FastViFi/issues/10 I'm using following singularity command to run fastvifi on a test files

python run_kraken_vifi_container.py \ --singularity \ --input-file test/test_reads_1.fq \ --input-file-2 test/test_reads_2.fq \ --output-dir ../test_out \ --virus hpv \ --kraken-db-path ../kraken_datasets \ --vifi-viral-ref-dir ../viral_data/ \ --human-chr-list test/human_chr_list.txt \ --vifi-human-ref-dir ../data_repo \ --level sample-level --skip-bwa-filter --keep-intermediate-files

Right after kraken finishes I face a bwa-related error

... [E::bwa_idx_load_from_disk] fail to locate the index files Traceback (most recent call last): File "/home/ViFi/scripts/get_trans_new.py", line 104, in bamFile = pysam.Samfile(opts.dataName[0], 'rb') File "pysam/libcalignmentfile.pyx", line 747, in pysam.libcalignmentfile.AlignmentFile.cinit File "pysam/libcalignmentfile.pyx", line 996, in pysam.libcalignmentfile.AlignmentFile._open ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False [E::hts_open_format] Failed to open file "/home/output/output_hpv.unknown.bam" : No such file or directory ...

All subsequent bam files are also empty.

I have data_repo and viral_data loaded from the google drive using link from the readme and there are no files looking like bwa index files. This error does not disappear after indexing hg38 and hg19 fasta files in data_repo. How do i fix this error?

Btw, it appears that hg19 value is hardcoded here https://github.com/sara-javadzadeh/ViFi/blob/b1a649685af0620a1d16a8940bb3e21db0fa17b5/scripts/cluster.sh#L10C1-L17C1 I am not sure if this script is used anywhere.

Best, Vera

sara-javadzadeh commented 8 months ago

Hi Vera,

Thanks for reaching out and sorry for the delay in response. This error is complaining about the input bam file not being present when ViFi is attempting to process. This could happen if one of the filtering steps that's running before the ViFi step is failing. Therefore, the input file to ViFi is empty and that's why we get the error. Could you please share all the non-empty intermediate fasta/fastq and bam files created by the command? I'm trying to figure out which step of the way is causing the problem.

Best, Sara

v-mukhina commented 7 months ago

Unfortunately I deleted all related files already and switched to another software. However, it looks like the issue is not the bam file itself but the reference one. bwa_idx_load_from_disk error usually pops up when the reference fasta file is not indexed by bwa index. I believe ViFi crashes on the very first bwa command (bwa mem?) that requires those index files for the reference and then all following bam files are empty or just absent.

v-mukhina commented 7 months ago

oh wait I found them! all bam files are empty but fastq files are not Archive.zip

sara-javadzadeh commented 7 months ago

Hi Vera,

Sorry to hear about your troubles with FastViFi. Thanks for sharing the output files. As you mentioned, It looks like the kraken step works well and the ViFi step fails. I could not replicate this problem as it works correctly on my end, using your exact command. I hear your point about index files for reference fasta files and it sounds valid. But it looks like the problem persists after you indexed the GrCh38 reference file in data_repo directory.  ViFi uses viral_data/hpv/grch38_hpv.fas file to map the input fastq files to the reference human and viral genomes. Do you have this file present in the downloaded viral_data directory? If so, could you please try indexing this fasta file as well and trying again?

Also, could you please share the version of singularity you are using? I am successfully running tests with singularity version 3.8.6.

The point you mentioned about HG19 reference being hard-coded in the code, is a good catch, but that code is not called for viral read detection.

Best, Sara

v-mukhina commented 7 months ago

this is what i have in the viral_data/hpv folder (viral_data.tar.gz was downloaded from the vifi repository as suggested in the readme):

image
v-mukhina commented 7 months ago

I've indexed hpv.unaligned.fas on my own to ensure this was not the reason for my issue.

sara-javadzadeh commented 7 months ago

Hi Vera,

I believe I understand the source of problem. There should be a grch38_hpv.fas and corresponding index file in the viral_data/hpv folder. This file is automatically created using these two lines in the setup_linux_mac.sh in the ViFi repo. I suggest running the whole script setup_linux_mac.sh. Moreover, as you already downloaded the data_repo and viral_data, please make sure to copy/move them to where setup_linux_mac.sh script is (in ViFi directory), before running it so it does not download the two directories again. The script creates human-viral-reference files for three viruses: HPV, HBV and HCV. If you are interested only in HPV, feel free to edit this line on the script to only run for hpv.

You should have a grch38_hpv.fas and corresponding index file in the viral_data/hpv directory after running this command. If you cannot see these files after running the setup script, please let me know.