lindenb / jvarkit

Java utilities for Bioinformatics
https://jvarkit.readthedocs.io/
Other
478 stars 132 forks source link

sam2tsv makes extraordinarily large files #161

Closed jasvinderahuja closed 4 years ago

jasvinderahuja commented 4 years ago

Verify

java. - 1.8.0_262

Subject of the issue

It works on some files and not on others. In most files sam2tsv makes extraordinarily large files > 100G. I am using Pacbio reads and have removed unmapped reads. Peculiarly, READ-POS0=. READ-BASE=. READ-QUAL=. REF-POS1=. CIGAR-OP = H and yet is still goes on...

Your environment

Steps to reproduce

I have shared the files in this link: https://www.dropbox.com/sh/1ermle431f47cze/AADNlOZtsIfZN0JpCp5tG0Wca?dl=0 cmd: java -jar /home/ahujajs/modulefiles/jvarkit/dist/sam2tsv.jar -R STE50toFUS1_S4921.fasta PCR_571_gatk.bam > PCR_571_gatk.sam2tsv.out

Expected behaviour

tsv file

Actual behaviour

It makes huge (>100G) tsv file and goes on calling differences even when alignments have run off

lindenb commented 4 years ago

Hi, recently changed the code, I looked fine but may be there is a bug.

Your link to dropbox returns 404.

jasvinderahuja commented 4 years ago

Thank you for the script it is a lifesaver, so simple and effective! I see the headers have changed and become even more informative. My error was due to unexpected fasta file. With some reads spanning the interval >1 time. It gave me the opportunity to learn pysam, which alerted me to the error. You may find it useful to add compatibility to handle such files, or give a more informative error in such cases.