Closed sean-workman closed 1 year ago
Thank you for providing great details.
We have found an error related to how query files are ended, and it showed similar logs.
Could you send us the results of tail -n 8 OCH16_1.fq > OCH16_1_8.fq
and tail -n 8 OCH16_2.fq > OCH16_2_8.fq
?
Then, we will check if you are facing the same error.
Please find the output of the two tail commands below:
tail -n 8 OCH16_1.fq > OCH16_1_8.fq
@NOVASEQ1:462:HJ2CNDSX5:4:2678:30083:37059 1:N:0:CGGTTACG+CTATAGTC TTCCCAAGCAGACTAAGCAGAAAAGAGACAGAGAGCCAAGAGAGGAAGAGGGCATAAATTACCAATATCAGAAATGAAAGGGACATTCCTACAGATCCTACAGATATTAAGCGGGTAACAAAGCACTATAAGGAACTGAATGCCAAG + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFFFFFFFFF @NOVASEQ1:462:HJ2CNDSX5:4:2678:31584:37059 1:N:0:CGGTTACG+CTATAGTC GGGTATAGGCAAATGAGAAACAGTGCTCTGTTATAGTTACTAGGTATTAAAAATAAACTTGACCAAGGCTAACGCTGTCTCTTATACACATCTCCGAGCCCACGAGACCGGTTACGGCAACGCGTATGCCGGCGTCGGCTGGAAAAGGGG + FFFFF:F:FF:FF,,F:FFFF,FF:F:F,:,FF,F,FFF::F,FFF:FFFFFF,FFF:,F,FFFFF,:FFFF,:FFFF:,F:FFFF:FFF,,,:F,F:FFF,F,F,FFF:,FFFF,FF:,F,,F,FF,,F:,,,F,:,F,,,,,,:,,FF
tail -n 8 OCH16_2.fq > OCH16_2_8.fq
@NOVASEQ1:462:HJ2CNDSX5:4:2678:30083:37059 2:N:0:CGGTTACG+CTATAGTC TCTTTGTATGTCAGTTTTGGTAGCTTGTGTTTGTGAAAAATTTGTCTGTTTCATCTACATTTTCTCTTGGCATTCAGTTCCTTATAGTGCTTTGTTACCCGCTTAATATCTGTAGGATCTGTAGGAATGTCCCTTTCATTTCTGATATTG + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFF @NOVASEQ1:462:HJ2CNDSX5:4:2678:31584:37059 2:N:0:CGGTTACG+CTATAGTC CGTTAGCCTTGGTCAAGTTTATTTTTAATACCTAGTAACTATAACAGAGCACAGTTTCTCATTTGCCTATACCCCCGTCTCTTATACACATCTGACGCTGCCGACGACTATAGTCTTGTGTAGATCTCGGTGGTCGCCGTATCATTAAAA + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFFFFF:
I have a set of files from the same run that have all behaved the same way, but this one is still fairly large. I am decompressing the smallest pair right now to ensure that I get the same behaviour and could likely share that. I just want to double check with my PI before I share complete sets of raw unpublished data around the globe. :)
Thanks! Could you attach the OCH16_1_8.fq
and OCH16_2_8.fq
here?
I want to check if EOF is located right after the last character.
Ah sorry about that! Here they are.
I've gotten the go ahead to share the input files as well - what would be easiest to get them to you? R1/R2 are about 5GB each.
Thank you! I have checked the two files, and the files are ended properly.
For sharing the files, we can try anyway you are familiar with.
However, before sharing the whole file, if you can reproduce the same error with small subsets, I think you can just upload here.
So, could you run head -n 80000 OCH16_1.fq > OCH16_1_80000.fq
and head -n 80000 OCH16_2.fq > OCH16_2_80000.fq
and test with the two small query files?
Here they are! I hope this helps. Please let me know if there is anything else I can do on my end.
Thanks again! I was able to reproduce the SegFault. Let me inspect the error during the weekend. I think you provided all the things I need to solve the problem! So, please just wait for me :)
Great to hear! Good luck, I look forward to trying out Metabuli once the bugs are fixed. :)
With your help, I was able to find the problem!
A very short query sequence was causing problems.
There are sequences of the length of about 20 in file OCH16_2_80000.fq.txt
(one case in line 72346),
and I found a problem with the function that filters out such cases.
I'll fix it soon and post an updated version.
Thank you again :)
I think I solved the issue related to reads that are too short to perform a six-frame translation. Please compile the latest Metabuli and test it.
overflow!!!
in the printed log when you test Metabuli with the full OCH16_1_val_1.fq
and OCH16_2_val_2.fq
Hi there,
I am running this now (did not get a chance last week) and I am indeed seeing overflow!!!
in the printed log. I can't imagine an overflow is good thing, but it sounds expected at least! I had to restart the run because my allocated resources on the cluster I'm using were going to expire before the job finished, but it is going again now and I will keep you informed about how it goes! :)
I'm now running with some reads that were not adapter trimmed and I see no overflow!!!
, which I think is the expected behaviour.
I am wondering if the alignment used in Metabuli is sensitive to adapter content or not?
Metabuli extracts k-mers from the whole region of query reads. So, k-mers from the adaptors are also extracted and compared to reference k-mers. If matches are found between them, the adaptor sequence region can affect the classification. Thus, it is recommended to trim your sequences before running Metabuli :)
About the overflow!!!
The overflow!!!
signal arises when there are too many matches between query and reference metamers.
And such a situation occurs due to low-complexity sequences.
I checked OCH16_1_80000.fq.txt
and found that some reads have very low complexities.
Here are some examples,
@NOVASEQ1:462:HJ2CNDSX5:1:1101:31656:35180 1:N:0:CGGTTACG+CTATAGTC
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFF,FFF::FF:F::F,FFF,FFFFFFFFFF,::F:F,:::F:F
@NOVASEQ1:462:HJ2CNDSX5:1:1101:20356:2550 1:N:0:CGGTTACG+CTATAGTC
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGGGGTTGTGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF::FFFFF,::F:FFFFFF::F::,,FFFFFF,F:,F,,:,,F::,,,:,,,,,,:,,,,,,,,,,,F
@NOVASEQ1:462:HJ2CNDSX5:1:1101:24849:35227 1:N:0:CGGTTACG+CTATAGTC
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTAAAAAAAAAAACACCCCCCCCCCCCGAGAAAAAAAAAAAAGTGTGAAGGAATGGGGTGAAAGAATAGGTGGGGGGGGGGGGGGGGGGGGGGGGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:,,,,,,,,,FFFFFF,,:,:,:::,,,,,,,,,,,,:,,,,,,,,,,,F,,,,,,,,,,,,,,,,,,,,::,,,,,,F:,,FFFFFFFFFFFFFFFFFFF:
I think you can avoid the overflow signal if you remove the low-complexity sequences.
However, Metabuli should be able to handle such cases instead of just giving the overflow!!!
massage.
So, we will update Metabuli in that way soon.
Let me close this issue because the segmentation fault error is solved, and I will open another issue for the overflow!!!
Thank you so much for testing our tool! It is helping us a lot 👍
Hi there,
As with #10 I am experiencing segmentation faults at the stage of "Extracting query metamers ...".
I am getting these errors whether I build from source with:
I ran the command:
metabuli classify OCH16_1_val_1.fq OCH16_2_val_2.fq /home/sdwork/scratch/metagenomics/gtdb fq_och16 fq_och16 --threads 32
When I tried looking at the core dump with gdb I saw:
`Program terminated with signal SIGSEGV, Segmentation fault.
0 0x00000000004568d3 in SeqIterator::fillQueryKmerBuffer(char const*, int, QueryKmerBuffer&, unsigned long&, unsigned int, unsigned int) ()`
I tried just using a pre-compiled binary on the cluster and saw the same error.
I tried downloading/installing using conda on one of our local machines and I encounter the exact same problem. I tried changing the permissions as was suggested in #10 and I see the same issues. I downloaded GTDB database locally with:
metabuli databases GTDB207 gtdb tmp
and am trying to run the command:
metabuli classify OCH16_1.fq OCH16_2.fq gtdb och16_out och16 --threads 14 --max-ram 50
The output I see is: