nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
487 stars 59 forks source link

FASTQ output (--emit-fastq) is not recognized by some downstream applications #997

Closed DaRinker closed 4 weeks ago

DaRinker commented 4 weeks ago

[QUESTION/REQUEST FOR ADVICE]

We have just started using dorado to basecall, de-multiplex, and trim adaptors from pod5 files.

I am now trying to assemble them using a well-tested pipeline in our group but am getting a frustrating error.

At the root of the problem is that the assembler we like/trust (https://github.com/mikolmogorov/Flye) requires fastq input. However, the assembler keeps choking on the FASTQ files I try to feed it.

To generate the fastq files from dorado I tried each of two approaches: 1) I tried --emit-fastq during the adaptor trim step and I also tried 2) simply using SAMtools to convert the dorado ubam to a fastq using samtools bam2fq).

Both of these fastqs appear (on visual inspection) to be formatted correctly (see below), and they re equivalent in terms of basic sequence/quality information. However, Flye doesn't like them, producing the error: ERROR: Can't identify input file type when I try to run the assembler. Given that other ONT fastq files (all called by Guppy) assemble as expected, I'm looking for solutions as to what might be going on in our case? Like, is there anything in the dordo basecalling process that might introduce artifacts into the output that would be hard to detect at a glance?

tl;dr QUESTION: Are there any "quirks" of dorado basecalling that might result in a fastq files not looking like a proper fastq file to some interpreters?

PS: I'm still waiting for the Flye developers to weigh in, but thought it worthwhile to ask from the basecaller side of things as well.

And if it helps, here's the head of one of my problematic fastq files. It looks like a good fastq file to me!:

$ head barcode02.bam_trimmed.fastq @fcd84ea3-85e5-437a-ba3e-900c894e41c0 st:Z:2024-05-14T17:52:23.517+00:00 RG:Z:7f7854be46aac37b85e749c5c1b729e69ac43456_dna_r10.4.1_e8.2_400bps_hac@v4.3.0_SQK-NBD114-24_barcode02 AAGGTTAAACAGACGACTACAAACGGAATCGACAGCACCTGCACCAACCATACCTAATAATAATATTATTGAACTTATTATTAATCATATTGAATAACTAGTATACATTATGTTTCCTATTCCTGTTATATGAGTAAATTCTACTAATGAACTATCTCAACCTTTACTACTTGCATAAGCTATTTCTTGTTTTAAATCTACTATCTGAAAAAATTCATTATCTAATTGAACATTATATACTTCAGAAAATGAATTTGATAAATAAGAAACAATTGTATTATCTGTTAAATTAGAAGGTAATACTTGTCCTACAATATAGTAGAAAAGTAAAACTGTTAAAACAGCTAAAGGTATATCATTATTAGTTTTCACTTAAAAGTTCAGATATTCTAATATTTATTAACATTAATATGAATAAGAATAAAATTGATACAGCTCCTACATACACTAAAATATAAGATAATCCTATGTAATTATATCCTACTAATATTAATAAACCAGCAATAAGTCATGATTATGGTCTATTAATTACAATTATTCTCCTGGTGGAGTCATGTCAAGCAGCCATTCTGCTCTTACTGCGAATCCGTGCTCCGTAAATATAAAGCAGCAGCGGGATAGGGATCATCAATACTCCCACACATCCAAGCAGTGTGCAGGCCCATTCAACACCTAATCCATCAAACTGTATCATTTCAGTCAGTTTCATCATTTGAAAGCCCAAGGTTTGGTGACTGACCATATACGAGGTGCTGTCGATTCCGTTTGTAGTCGTCTGTTTTAACCTTAGCATACGTATGG + BFGJKHECFAHPLHDEBBHSFSKEK98D9898@BFHFGEEFJIGKFNOLSLSFSGSIKKJKSJJSH??@;;;EFCFCDILSSSSFSSSJJSLOSGFD@ACBFJFJEJQSSSS?>SSSLMLMGLOHGHSHSSIIISKLLHFGELISSSFJSFMLKHHISGJGJMCISJFSKHISKLLHNNSJSKFGE@?ADNEPSJFGGHJDE.,'''SJGDFIFILKSGGDDDLSGPRMSSOSNIICLG2222@KKS?>??>HGSJA778SGIEPHCHKEFSEIMHQSAABADBF@9IA??@BILPKBA;::9<=E631158=8899;=@IKC<>CGEG>;,(+DJAIECDFDLIJIIIJGGFNMSMJCABAAC=;<>>77SLSNLKOJINISKMHLNSJSOPLSHGSSSGNC@AAB/-+)+&'+,59JSJFGFIJJGCJIMGSOMSHRMSJSDGDJF>@>ACB9B5FLQLFCBCDKMISRJGD@KGDFCE;3,'*=?BBHHGEIHJSGLEFOIFHFEFRQ8673222=<?DGSFSGGIGEFCEDDLKPSSMLJSMLGB@;@EFFJFFGFEEIFFGBA7;;52+**+00+-))))*0424599978DEELFLISIB@?@F;::9G@<;:6<?>5555D?CC@HFEDKGHSNHSJSFOKOJSGIFIDDFSGCDDCFE<HSFJIKSSKKEILJISIISFIJGDDMGEHRSG;:::LJPJJPHGFJGFIGEHJHFKMMCABDHSNHRGFBBDHPKSHGIE@@?COGLGKSIE666/./+,+)10036662,--/033//--) @41c3fb3a-1820-407f-ba8d-4bc9a28b883e st:Z:2024-05-14T17:53:59.847+00:00 RG:Z:7f7854be46aac37b85e749c5c1b729e69ac43456_dna_r10.4.1_e8.2_400bps_hac@v4.3.0_SQK-NBD114-24_barcode02 AGTGTTATGTACACTGATTCAGTTACATTGTGCTTTGCTAAGGTTAAACAGACGACTACAAACGGAATCGACAGCACCTATATGTATACAGCCCAGATGGCCCATGCCTAGAGCGCTATCCAGGGGCGCGCGCTGCCGACTGGATATCTAGAGAACCATGAGCAGAATGAGGTGCTGTCGATTCCGTTTGTAGTCGTCTGTTTAACCTTAACAATGGTA + ***.4)'(&$##%%%%%&&&&'&%##$##$(&%&$)-2465?=AD?>;<BDEFGE@@CIAABBEPDFFFOSILFEDHHFAA565656<?BC<GMICSLSJFJEFOHDHHSIFHFJSPABDJFFDECBDEFNHIFEDDKSHJIOFFB?CCCB?10++,0.+-+,7689AHHCJ=-,,.;8/7987B@?A===AAA;<;8:99;4%#$''(+- @d81f95f0-66fa-49e7-9c39-403b99c7d5f2 st:Z:2024-05-14T17:52:04.836+00:00 RG:Z:7f7854be46aac37b85e749c5c1b729e69ac43456_dna_r10.4.1_e8.2_400bps_hac@v4.3.0_SQK-NBD114-24_barcode02 AAGGTTAAACAGACGACTACAAACGGAATCGACAGCACCTCAATCAGTCCAGCTGCTGGCCCCAATTGATCTGAATGAATGTAATAAGGAAATGCAATGTATTATTCAGAGAAAGATCAAGAGCAAATACTCTTGCAAGCTAAGATGAGATTGATACAGAGAATCAAGCGTCATCAATCAGCAGGGTCTTCTGCAACTATATACCGTCTTTGCCCCGAGGTGCTGTCGATTCCGTTTGTAGTCATCTGTTTAACCTTAGCGATA

HalfPhoton commented 4 weeks ago

Do you get a specific error?

DaRinker commented 4 weeks ago

Unfortunately, the only error given is that one (Can't identify input file type). And without knowing what the software is cueing in one,, I'm stymied. Asking for insight here is my long shot...

HalfPhoton commented 4 weeks ago

Looking for that error string in the Flye codebase we get a hit in seqeuence_container.cpp:33

It looks like one (or more) of the files being passed to Flye has a filename which is missing the fasta/fa or fastq/fq extension - specifically after a . character.

Flye code snippet:

bool SequenceContainer::isFasta(const std::string& fileName)
{
    std::string withoutGz = fileName;
    if (fileName.substr(fileName.size() - 3) == ".gz")
    {
        withoutGz = fileName.substr(0, fileName.size() - 3);
    }

    size_t dotPos = withoutGz.rfind(".");
    if (dotPos == std::string::npos)
    {
        throw ParseException("Can't identify input file type");
    }
    std::string suffix = withoutGz.substr(dotPos + 1);

    if (suffix == "fasta" || suffix == "fa")
    {
        return true;
    }
    else if (suffix == "fastq" || suffix == "fq")
    {
        return false;
    }
    throw ParseException("Can't identify input file type");
}

I hope this helps 👍 Best regards, Rich

DaRinker commented 4 weeks ago

Thank you, Rich.

I was hoping this would solve it for me. My filename was originally named barcode02.bam_trimmed.fastq so I changed it to barcode02_bam_trimmed.fastq to remove the other "." But flye still crashes out with the same error.

I also tried simply changing my input filename to the filename of a pervious (guppy-called) fastq file that flye processes without issue--flye again, crashed.

So it seems like the error goes beyond what's in the file name itself...

Oh well...thank's for trying to help.

And, gonna close this since I think it's more a problem with flye.

Really appreciate your insight, though.