gmarcais / Jellyfish

A fast multi-threaded k-mer counter
Other
460 stars 136 forks source link

Null characters in text output and missing k-mers #174

Open fluhus opened 3 years ago

fluhus commented 3 years ago

Hello,

I was going over the text output of Jellyfish and got some unexpected results. After some debugging, I found that the text output included null characters (\x00) which affected my downstream parsing, and some missing kmers. I am using Jellyfish 2.3.0 on Linux.

Input fastq, in quote format, to make sure there are no strange characters lurking:

"@A00806:9:HJHTVDMXX:1:1138:32678:16000 1:N:0:ATGCGCAG+ACTGCATA\nCAAGGAGGAGCTTGCAGACCCCGAGGGACGGGAGTTTCAGGCTGTACGTGACGAACTTAACAAGCACTATGACCGCCTTTCGTTGAAAGACAATTATTCA\n+\n:FFFFFFF:FFFFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF\n@A00806:9:HJHTVDMXX:1:1139:31096:26725 1:N:0:ATGCGCAG+ACTGCATA\nGAATAATTGTCTTTCAACGAAAGGCGGTCATAGTGCTTGTTAAGTTCGTCACGTACAGCCTGAAACTCCCGTCCCTCGGGGTCTGCAAGCTCCTCCTTGT\n+\nFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF\n"

Command:

jellyfish-linux count -m 32 -s 20 -Q "!" --text -o myfile.jf myfile.fastq

Output 1st line, in quote format:

"000000735{\"alignment\":8,\"canonical\":false,\"cmdline\":[\"count\",\"-m\",\"32\",\"-s\",\"20\",\"-Q\",\"!\",\"--text\",\"-o\",\"/tmp/amitmit/stupid.jf2\",\"/tmp/amitmit/stupid.fastq\"],\"exe_path\":\"/net/mraid08/export/genie/LabData/Analyses/amitmit/jellyfish-linux\",\"format\":\"text/sorted\",\"hostname\":\"genie40.mcl2.weizmann.ac.il\",\"key_len\":64,\"matrix1\":{\"c\":64,\"columns\":[188,176,78,231,155,110,47,48,156,86,53,120,58,201,42,78,210,10,145,157,2,109,236,226,164,77,165,4,188,141,251,211,37,7,70,89,35,106,226,165,225,40,16,101,68,58,127,36,33,152,179,74,154,132,216,36,146,99,10,5,202,167,224,80],\"identity\":false,\"r\":8},\"max_reprobe\":7,\"pwd\":\"/home/amitmit/Desktop/kmers/queue\",\"reprobes\":[1,1,3,6,10,15,21,28],\"size\":256,\"time\":\"Tue Feb 16 14:59:11 2021\",\"val_len\":7}\x00\x00\x00AGCACTATGACCGCCTTTCGTTGAAAGACAAT 1\n"

Notice the null characters following the {...} part.

K-mers missing from my result:


The null characters seem to appear consistently in different runs on different inputs. Their amount varies from run to run.

As for the missing k-mers, could they be getting filtered out? I am not sure if that's something missing in my params or a bug on Jellyfish's side.