refresh-bio / KMC

Fast and frugal disk based k-mer counter
277 stars 72 forks source link

KMC abrubtly not working #222

Open jrostudent opened 1 year ago

jrostudent commented 1 year ago

For most of the time I've been using kmc it has been working with few issues. However I recently had to start working with .fastq file data stored on a directory in such a way that I need to use absolute paths (what I believe to be the root cause of the issue anyway).

This has caused me to get two kinds of errors, either: A) [jrosen5@c005:~/applied_proj/sandbox]$ bin/kmc -k27 -ci50 "/scratch/jrosen5/applied_proj/sandbox/data/PRCRreads/SRR5088929_1.fastq.gz" histogram . -sm


Stage 1: 94%Killed

or B) Error: unknown exception

marekkokot commented 1 year ago

Hi,

I know there are a few issues, like the irritating unknown exception (which in most cases is in fact not unknown, but wrongly propagated, so the user is seeing this nonsense message). We have this fixed, but not published yet. What is the amount of RAM you have on your machine? Also the -sm should be before the input path, so:

bin/kmc -k27 -sm -ci50 "/scratch/jrosen5/applied_proj/sandbox/data/PRCRreads/SRR5088929_1.fastq.gz" histogram . 

Do you really need this switch? I mean it seems the dataset is quite small, so KMC will probably not use more than the default 12GB of RAM anyway. I don't think the absolute path could cause these issues, I mean KMC was used with absolute paths for a quite long time and I have never encouraged or heard of any problem rising from the absolute path (but of course I am not saying it is not possible). Anyway, let me know how much RAM you have or maybe just in case try to run it with a small amount of ram with -m2 (2GB).

jrostudent commented 1 year ago

Hi,

I know there are a few issues, like the irritating unknown exception (which in most cases is in fact not unknown, but wrongly propagated, so the user is seeing this nonsense message). We have this fixed, but not published yet. What is the amount of RAM you have on your machine? Also the -sm should be before the input path, so:

bin/kmc -k27 -sm -ci50 "/scratch/jrosen5/applied_proj/sandbox/data/PRCRreads/SRR5088929_1.fastq.gz" histogram . 

Do you really need this switch? I mean it seems the dataset is quite small, so KMC will probably not use more than the default 12GB of RAM anyway. I don't think the absolute path could cause these issues, I mean KMC was used with absolute paths for a quite long time and I have never encouraged or heard of any problem rising from the absolute path (but of course I am not saying it is not possible). Anyway, let me know how much RAM you have or maybe just in case try to run it with a small amount of ram with -m2 (2GB).

jrostudent commented 1 year ago

Sorry I didnt mean to close and reopen, just reply. I fixed the issue where it was killing it at 97%, however I have yet to find what is causing the unknown exception error. What is interesting is that it only throws the unknown exception with one particular file, I am working with fastp for filtering and paired end merging, all reads that aren't able to be merged are sent to two files reflecting the original files, however the reads that are merged are sent to a third file.

When using kmc on either unmerged file it operates without issue, however when using it on the file of merged reads it throws the unknown exception error. I used a head -50 command to a manual inspection of the file for differences in structure, but they appear to be the same. What steps would you suggest I take to solve this? Thank you for your response by the way!

marekkokot commented 1 year ago

Hi, could you share these files? I will try to reproduce.

jrostudent commented 1 year ago

the unmerged one that works is a 10GB file, and I'd have to retrieve it from the HPC, would you be ok with me just sharing the merged file that doesn't work? its only about 67MB

marekkokot commented 1 year ago

Sure, a smaller file causing issues is even better :)

jrostudent commented 1 year ago

I had to zip it because Github said it doesn/t support the file type, but it should be a .fastq format when unzipped. Thank you! fastpPE12.zip

marekkokot commented 1 year ago

Thanks! I think I know the reason. Here is the very first record:

@SRR5088818.367 HWI:1:X:1:1101:14228:2641 length=51 merged_51_16
CCTAACTTCAACTCACAGAAGATTGTGGCAAACACCCATTAACTTTTCTACACAACTACCATTTCAA
+SRR5088818.367 HWI:1:X:1:1101:14228:2641 length=51
@@@FFDDEHGFFHJEHIGGIEGAHIHHGGJIGGGIJEHJJJJJIIGCCHIBFIHHHFDHDDDDB?@@

note that the header of quality is different than the header of sequence, which I believe is not allowed in fastq format. I mean the qual header should be either: empty (just +) or the same as sequence harder (Wikipedia seems to confirm that (but there are other sources saying the same): https://en.wikipedia.org/wiki/FASTQ_format: "Field 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again."). KMC checks (I am not sure if every sequence and quality header are checked, but for sure some of them are) if this condition is met. If not it fails (of course error message should be different).

I think it's best to keep only the + sign in the quality header line avoiding redundance in the data.

In summary, I think KMC behaviour is OK (except error message). Let me know what you think.

jrostudent commented 1 year ago

Thank you so much!

jrostudent commented 1 year ago

Hey, I wrote a sed command- sed -i 's/merged[0-9]*[0-9]*//g' "$mergeReadout"

to delete the line that differentiated the two headers, for context here is a snippet of the problematic .fastq file now.

@SRR5088929.119.1 119 length=51 GAAAGAACATAGTTTTATTTCCGTGAACTATACTTTTTCCCCAGAAGCTCTAATAATTGGCATTAAAAAA +SRR5088929.119.1 119 length=51 CCCCCGGGGGGGGGGGGGGGGGGGGFGGGFGGGGGEGGGGGGGGGGGGEGFGGGGGGGGGGGGGGCBBCC

As you can see the two headers are now identical, however kmc still throws the unknown exception error during processing.

jrostudent commented 1 year ago

Update: I modified the command to include unmerged (basically increases the amount of data in the input file for kmc) and it seemed like there was an improvement because instead of throwing the unknown exception error it actually started stage 1, then threw the following error:

Stage 1: 84% Stage 1: 85%Error: some error while reading fastq file, please contact authors (kmc_core/fastq_reader.cpp: 844) Error: Cannot open file histogram.kmc_pre

jrostudent commented 1 year ago

@marekkokot, hey I just wanted to update you on the status of the error:

  1. I took your advice and used a sed command to edit the fastq file to make it identical to standard fastq format by removing the seq header merged__ string. Here is a code snippet including the sed command, the gzip after, and the kmc command used.

sed -i 's/merged[0-9]*[0-9]*//g' "$mergeReadout"

gzip "$mergeReadout"

kmc -k27 -ci50 "$mergeReadout" histogram .

  1. I repeatedly got the error- Stage 1: 84%Error: some error while reading fastq file, please contact authors (kmc_core/fastq_reader.cpp: 844)

  2. So i made a bash script to inspect the region around 84/85% of the file and found it to meet standard fastq format. Unfortunately after all these adjustments I am still unabel

marekkokot commented 1 year ago

It seems you are not removing space before "merged". Try this:

sed -i 's/ merged_[0-9]*_[0-9]*//g' "$mergeReadout"