split_bam.py fails to sort output exogenous bam file (broken header)

lucananni93 commented 2 years ago

Dear developers, thanks for creating this very useful package. I am encountering a problem when using your split_bam.py script.

The script runs well and it is able to output the final report.txt as well as the unsorted and sorted bams for both and human reads. For the exogenous reads it is only able to output the unsorted bam file, while the sorting step fails and the following error is prompted by samtools:

[main_samview] fail to read the header from "<SAMLPLE_NAME>_exogenous.sorted.bam".

It seems that when splitting the merged bam file something goes wrong with the compilation of the exogenous reads.

Is it possible to fix the issue?

liguowang commented 2 years ago

Hi, it took me a while to find out the problem but still not clear how this happened. In the "*_exogenous.bam" file, if you check the header, there is a line "[main_samview] truncated file" (or something similar to this, which was incorrectly inserted into the BAM file by samtools). Manually remove this line, you will be able to sort and index the BAM file.

lucananni93 commented 2 years ago

Sorry, I don't understand how to "check the header".

If I try to run samtools view -H on the "*_exogenous.bam" file, I still get the [main_samview] fail to read the header from message. Can you tell me the sequence of commands to do that?

Thanks again!

ccmeth commented 1 year ago

Dear developers, thanks for creating this very useful package. I am encountering a problem when using your split_bam.py script.

The script runs well and it is able to output the final report.txt as well as the unsorted and sorted bams for both and human reads. For the exogenous reads it is only able to output the unsorted bam file, while the sorting step fails and the following error is prompted by samtools:
[main_samview] fail to read the header from "<SAMLPLE_NAME>_exogenous.sorted.bam".
It seems that when splitting the merged bam file something goes wrong with the compilation of the exogenous reads.

Is it possible to fix the issue?

I met with the similar issue. Examples developers provided worked well for me, but when I deal with another data and performed "samtools sort _exogenous.bam > tmp.bam", an error occured: "[bam_sort_core] truncated file. Aborting." In fact, such an error does not influence the generation of report file and _human.sorted.bam, which are, in my opinion, the most important files in this step since these files will be used in the final step. Thus I wondered whether the content in report.txt, especially the number of fly reads, will be affected by this error?

ccmeth commented 1 year ago

Dear developers, thanks for creating this very useful package. I am encountering a problem when using your split_bam.py script. The script runs well and it is able to output the final report.txt as well as the unsorted and sorted bams for both and human reads. For the exogenous reads it is only able to output the unsorted bam file, while the sorting step fails and the following error is prompted by samtools:
[main_samview] fail to read the header from "<SAMLPLE_NAME>_exogenous.sorted.bam".
It seems that when splitting the merged bam file something goes wrong with the compilation of the exogenous reads. Is it possible to fix the issue?
I met with the similar issue. Examples developers provided worked well for me, but when I deal with another data and performed "samtools sort _exogenous.bam > tmp.bam", an error occured: "[bam_sort_core] truncated file. Aborting." In fact, such an error does not influence the generation of report file and _human.sorted.bam, which are, in my opinion, the most important files in this step since these files will be used in the final step. Thus I wondered whether the content in report.txt, especially the number of fly reads, will be affected by this error?

Hi, it took me a while to find out the problem but still not clear how this happened. In the "*_exogenous.bam" file, if you check the header, there is a line "[main_samview] truncated file" (or something similar to this, which was incorrectly inserted into the BAM file by samtools). Manually remove this line, you will be able to sort and index the BAM file.

When I performed "samtools view -h *_exogenous.bam > tmp", I indeed got an error: [main_samview] truncated files. Is samtools version responsible to this issue?

liguowang commented 1 year ago

One possible reason is that the header section of "*_exogenous.bam" does not math the alignments. This could happen when you did not change the chromosome IDs of the Drosophila genome (Drosophila genome also has chrX, chrY, and chr4 which are the same to humans')

ccmeth commented 1 year ago

One possible reason is that the header section of "*_exogenous.bam" does not math the alignments. This could happen when you did not change the chromosome IDs of the Drosophila genome (Drosophila genome also has chrX, chrY, and chr4 which are the same to humans')

Splitbam.py could add the dm6 by default. I used samtools with different versions and always get the error. When I tried to index the _exogenous.bam, an error appeared: "samtools index: failed to create index for _exogenous.bam: numerical result out of range".

liguowang commented 1 year ago

Dear developers, thanks for creating this very useful package. I am encountering a problem when using your split_bam.py script. The script runs well and it is able to output the final report.txt as well as the unsorted and sorted bams for both and human reads. For the exogenous reads it is only able to output the unsorted bam file, while the sorting step fails and the following error is prompted by samtools:
[main_samview] fail to read the header from "<SAMLPLE_NAME>_exogenous.sorted.bam".
It seems that when splitting the merged bam file something goes wrong with the compilation of the exogenous reads. Is it possible to fix the issue?
I met with the similar issue. Examples developers provided worked well for me, but when I deal with another data and performed "samtools sort _exogenous.bam > tmp.bam", an error occured: "[bam_sort_core] truncated file. Aborting." In fact, such an error does not influence the generation of report file and _human.sorted.bam, which are, in my opinion, the most important files in this step since these files will be used in the final step. Thus I wondered whether the content in report.txt, especially the number of fly reads, will be affected by this error?

Hi, it took me a while to find out the problem but still not clear how this happened. In the "*_exogenous.bam" file, if you check the header, there is a line "[main_samview] truncated file" (or something similar to this, which was incorrectly inserted into the BAM file by samtools). Manually remove this line, you will be able to sort and index the BAM file.
When I performed "samtools view -h *_exogenous.bam > tmp", I indeed got an error: [main_samview] truncated files. Is samtools version responsible to this issue?

Thus I wondered whether the content in report.txt, especially the number of fly reads, will be affected by this error?

Numbers in the report.txt file would NOT be affected by this error. This error is likely caused by the inconsistency between the SAM header and the alignments. In particular, I am guessing the coordinates of some reads were out of the scope of the chromosome size. But, I am still not clear why this happened. Could you please share the header section of your *_exogenous.bam ?

ccmeth commented 1 year ago

Dear developers, thanks for creating this very useful package. I am encountering a problem when using your split_bam.py script. The script runs well and it is able to output the final report.txt as well as the unsorted and sorted bams for both and human reads. For the exogenous reads it is only able to output the unsorted bam file, while the sorting step fails and the following error is prompted by samtools:
[main_samview] fail to read the header from "<SAMLPLE_NAME>_exogenous.sorted.bam".
It seems that when splitting the merged bam file something goes wrong with the compilation of the exogenous reads. Is it possible to fix the issue?
I met with the similar issue. Examples developers provided worked well for me, but when I deal with another data and performed "samtools sort _exogenous.bam > tmp.bam", an error occured: "[bam_sort_core] truncated file. Aborting." In fact, such an error does not influence the generation of report file and _human.sorted.bam, which are, in my opinion, the most important files in this step since these files will be used in the final step. Thus I wondered whether the content in report.txt, especially the number of fly reads, will be affected by this error?

Hi, it took me a while to find out the problem but still not clear how this happened. In the "*_exogenous.bam" file, if you check the header, there is a line "[main_samview] truncated file" (or something similar to this, which was incorrectly inserted into the BAM file by samtools). Manually remove this line, you will be able to sort and index the BAM file.
When I performed "samtools view -h *_exogenous.bam > tmp", I indeed got an error: [main_samview] truncated files. Is samtools version responsible to this issue?

Thus I wondered whether the content in report.txt, especially the number of fly reads, will be affected by this error?
Numbers in the report.txt file would NOT be affected by this error. This error is likely caused by the inconsistency between the SAM header and the alignments. In particular, I am guessing the coordinates of some reads were out of the scope of the chromosome size. But, I am still not clear why this happened. Could you please share the header section of your *_exogenous.bam ?

header.txt The header file was same as that from the example you provided. Besides, I wondered if such issue was related to paired-end sequencing? Since two independent paired-end sequencing datasets all exhibited similar error but another single-end datasets (2016 plos one) was ok. Paired-end dataset 1:

Paired-end dataset2:

liguowang commented 1 year ago

Hi, I update split_bam.py (the new version is: https://github.com/liguowang/spiker/blob/main/bin/split_bam.py). Not totally sure if this will fix the problem (i.e., failed to index/sort the "*_exogenous.sorted.bam" file). please try this version and let me know if this works. I will push the new version to pypi if it works. Thanks a lot.

ccmeth commented 1 year ago

Hi, I update split_bam.py (the new version is: https://github.com/liguowang/spiker/blob/main/bin/split_bam.py). Not totally sure if this will fix the problem (i.e., failed to index/sort the "*_exogenous.sorted.bam" file). please try this version and let me know if this works. I will push the new version to pypi if it works. Thanks a lot.

No, the new code still generated the same issue.

zifeng9527 commented 2 months ago

I encontered the sam problem.

When I run the code in the loop "for i in $(cut -f 2,3,5,6 Sample.info | tr '\t' '\n' | sort -u); do echo "/data/home/001_wzf/000_index/000_pipeline/004_ATACSeq/002_hg19.2Rep/split_bam.py --threads 24 -i "$i"_remChrM.rmdup.bfdownsample.hg19dm6.bam -o "$i"_remChrM.rmdup.bfdownsample.hg19dm6"; done | sort | uniq | xargs -iCMD -P0 bash -c CMD ", all the "_exogenous.bam" are wrong.

Howevere, when I run the code singlely, the output bam is right. /data/home/001_wzf/000_index/000_pipeline/004_ATACSeq/002_hg19.2Rep/split_bam.py --threads 24 -i MDA231.shDYL1.sh1.DOXn_Rep1_remChrM.rmdup.bfdownsample.hg19dm6.bam -o MDA231.shDYL1.sh1.DOXn_Rep1_remChrM.rmdup.bfdownsample.hg19dm6

I will try to reduce the threads in the loop.

liguowang / spiker

split_bam.py fails to sort output exogenous bam file (broken header) #1