Adding an 'N' to no coverage positions of the consensus

mnebroski commented 3 years ago

It's been noted that some segments that are not complete are having the gaps removed from the consensus to create a continuous sequence. From the IRMA paper it looks like by default there is deletion editing occurring to remove gaps from the sequence:

In addition to reference elongation, deletion editing occurs if the consensus allele is in a deleted state. One can optionally delete by ambiguation—replacing the reference base with an ‘N’ nucleotide—to retain the same reference size. This option is helpful when using BLAT in the align step instead of SAM.

Would it be possible to adjust the workflow to by default include an 'N' in the consensus at positions of no coverage?

peterk87 commented 3 years ago

It looks like a custom config file would need to be installed for IRMA to set the parameters DEL_TYPE="NNN" and ALIGN_PROG="BLAT". Unfortunately, it doesn't look like these can be set from the command-line (e.g. DEL_TYPE="NNN" IRMA FLU reads sample or specifying a path to a config file).

It looks like I'll have to update the IRMA Singularity image adding config files with the desired parameters set. In that case, I might as well update IRMA too.

peterk87 commented 3 years ago

Hi @mnebroski

I've just created a version 2.0.0 release for the workflow, which updates IRMA to version 1.0.2, adds "deletion by ambiguation" (DEL_TYPE="NNN") by default for IRMA. The default IRMA FLU module is FLU-utr which might help resolve the UTR regions a bit better. More info in the changelog

Please note that a samplesheet is now required as input rather than a quoted path to some reads. More info in the usage docs and a script to make it easier to create a samplesheet from a FASTQ directory.

If you get a chance, let me know if this new release resolves this issue or if further changes are needed.

mnebroski commented 3 years ago

Hi @peterk87

I reran the last couple of runs with the updated workflow, and on the one run it worked great! It seemed to solve the issue where the NA subtype prediction was called differently than the most abundant NA read count we saw in the one sample. On the second run I got the error:

`Oops... Pipeline execution stopped with the following message: WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.

whoami: unknown uid 1000 cat: can't open 'sample/amended_consensus/*.fa': No such file or directory`

The sample was a blank sample so there weren't any assembled segments expected, but it seems that the rest of the workflow wouldn't continue with this error.

peterk87 commented 3 years ago

Thanks for letting me know about that issue. I'll fix it in a patch release ASAP

peterk87 commented 3 years ago

Hi @mnebroski

The issue you encountered with the blank sample should be resolved with release 2.0.1 (fixed in #6). I've added blank/empty FASTQ files to the workflow test dataset to make sure there's no more regressions in the future.

mnebroski commented 3 years ago

Hi @peterk87

I pulled the latest release and reran the samples, which did resolve the blank sample issue! However the samples that we were expecting to see N's in the consensus were still shorter than the expected segment length and did not contain the N's. It looks like the default for keeping the reference deletions is set to true, but is there another parameter that I'm missing? The command I ran was:

nextflow run peterk87/nf-iav-illumina --input samplesheet.csv -profile docker

peterk87 commented 3 years ago

Hi @mnebroski

I think I'd need more information and some example data/results to see what you mean. Are you talking about internal gaps being removed? Or are you talking about the ends which don't have any reads mapping?

FYI IRMA uses its own set of reference sequences for the various segments and subtypes for performing assembly (flu-amd/IRMA_RES/modules/FLU/reference/consensus.fasta). I'm not sure what impact these reference sequences have on the final consensus sequences generated.

mnebroski commented 3 years ago

Hi @peterk87

It looks like internal gaps are being removed. When we Blast the consensus sequence the results are broken up into several smaller alignments instead of one continuous alignment. The expected length for the segment is also shorter than expected (for example segment 1 should be ~2341nt and the consensus sequence generated is ~1500nt). I can provide you with the specific samples that we're seeing this in if that will help.

Oh okay that's interesting and good to know! Is there a way for us to edit this file to provide our own sequences or could that potentially cause problems with the way the IRMA analysis is performed?

peterk87 commented 3 years ago

Yes, if you could link/pass on the specific reads and IRMA output directories where that is happening, then I could do some troubleshooting and see if it's just a matter of adjusting some parameters.

Is there a way for us to edit this file to provide our own sequences or could that potentially cause problems with the way the IRMA analysis is performed?

We could try creating a new module using the FLU module as a template, however, the process seems quite involved:

https://wonder.cdc.gov/amd/flu/irma/modules.html

There's also a FLU_AD module which may be worth trying out before trying to create a new IRMA module.

peterk87 / nf-flu

Adding an 'N' to no coverage positions of the consensus #3