karel-brinda / NanoSim-H

NanoSim-H: a simulator of Oxford Nanopore reads; a fork of NanoSim.
Other
17 stars 6 forks source link

Create R9.4 E. coli profile #13

Open mbhall88 opened 5 years ago

mbhall88 commented 5 years ago

Hi @karel-brinda ,

I have just finished training a new profile with some E. coli R9.4 data that was basecalled with the latest Guppy high accuracy algorithm (v3.1.5).

Is there a way of "testing" this? Also, what would be the best way for me to share this with you so you can add it to the default profiles?

karel-brinda commented 5 years ago

Hi Michael, great, thank you so much for digging into this. The easiest way would be to put it here: https://github.com/karel-brinda/NanoSim-H/tree/master/nanosimh/profiles and create a pull request. Do you think I should make it the "default default" profile? :)

mbhall88 commented 5 years ago

I would like to make sure it isn't producing rubbish first. Let me simulate some reads from something and plot it. I'll put it up here and see what you think.

mbhall88 commented 5 years ago

tl;dr There doesn't seem to be any problems with the model.

Training

Ok, so I trained the model on E. coli data basecalled with Guppy (v3.1.5) high accuracy model that was sequenced for the ultra-long read protocol from Loman et al.. The reference I used in the training was Escherichia coli str. K-12 substr. MG1655 (NC_000913.3). This was run with

container="docker://quay.io/biocontainers/nanosim-h:1.1.0.4--pyr351h24bf2e0_1"
reads="loman_k12.fa"
reference="NC_000913.3.fa"
profile_dir="profile/"

singularity exec "$container" nanosim-h-train \
    --infile "$reads" \
    "$reference" \
    "$profile_dir"

This took ~4 days and there was an error at the end saying that nanosim-h couldn't find the model_fitting.R script (I suspect this is something to do with using a bioconda/biocontainers image). So I also ran the following afterwards to do the model fitting.

R CMD BATCH '--args prefix="profile/"' "model_fitting.R"

Testing

To test whether the output is crazy I simulated reads from the TB reference NC_000962.3

ref="NC_000962.3.fa"
container="docker://quay.io/biocontainers/nanosim-h:1.1.0.4--pyr351h24bf2e0_1"
profile="nanosim_training/ecoli/profile"

singularity exec "$container" nanosim-h \
    --circular \
    "$ref" \
    --profile "$profile"

The read length distribution is

read length distribution

Happy for someone to correct me but this doesn't seem too unusual. Especially given the dataset I simulated this sample from.

And if I map the reads to the reference they were simulated from

minimap2 -ax map-ont NC_000962.3.fa simulated.fa | samtools sort -o output.bam

the percent identity looks like percent identity

In case there are questions about how the percent identity it calculated I will put the code below

def sam_percent_identity(filename: str):
    """Opens a SAM/BAM file and extracts the read percent identity for all
    mapped reads that are not supplementary or secondary alignments.
    Args:
        filename: Path to SAM/BAM file.
    Returns:
        A list of the percent identity for all valid reads.
    """
    # get pysam read option depending on whether file is sam or bam
    file_ext = os.path.splitext(filename)[-1]
    read_opt = "rb" if file_ext == ".bam" else "r"

    # open file
    samfile = pysam.AlignmentFile(filename, read_opt)

    perc_identities = []
    for record in samfile:
        # make sure read is mapped, and is not a suppl. or secondary alignment
        if record.is_unmapped or record.is_supplementary or record.is_secondary:
            continue
        pid = get_percent_identity(record)
        if pid:
            perc_identities.append(pid)

    samfile.close()

    return perc_identities

def get_percent_identity(read: pysam.AlignedSegment):
    """Calculates the percent identity of a read based on the NM tag if present
    , if not calculate from MD tag and CIGAR string.
    Args:
        read: A read within a sam file (pysam class).
    Returns:
        The percent identity or None if required fields are not present.
    """
    try:
        return 100 * (1 - read.get_tag("NM") / read.query_alignment_length)
    except KeyError:
        try:
            return 100 * (
                1
                - (_parse_md_flag(read.get_tag("MD")) + _parse_cigar(read.cigartuples))
                / read.query_alignment_length
            )
        except KeyError:
            return None
    except ZeroDivisionError:
        return None

perc_identities = sam_percent_identity("output.bam")
mbhall88 commented 5 years ago

Do you think I should make it the "default default" profile? :)

Given R9.4 is much newer than R9 and not many people really work with R9 now it probably makes sense to make it the default.
However, this is your repository so it is completely up to you.

karel-brinda commented 5 years ago

Hi Michael, thank you very much for the update and all the associated work! It looks like a very good improvement. I can't fully review the PR now (from time reasons), but I will do that as soon I can (probably July) and then I'll also release a new version. Thanks.

karel-brinda commented 5 years ago

Hi Michael,

thank you for the absolutely amazing work you did!!

My only question is the following – do you think the data that the model was trained on, are "typical"? In my experience, data produced by leading experts on nanopore sequencing are often of much higher quality than what a typical user gets. I can think of two reasons:

  1. Much better skills and incomparably more experience
  2. Membership in the early access program, which potentially might allow using different flowcells or reagents than normal users receive

Given your experience, does the simulation well correspond to what you get in your lab (from your comment it seems that yes)?

Btw. I'm thinking how to incorporate your scripts and text in the repository – it will be very useful for others as well.

Thanks again for all the work!

mbhall88 commented 5 years ago

Hi Karel,

I mean obviously, Nic and Josh are leading experts. I think at the time this data was sequenced I would say it was of much higher quality than most people would have been producing. But from being at London Calling recently it seems to me like most people use this protocol now. So this data, whilst not "typical" at the time, would now probably be considered "typical" (or close to). Happy for someone to correct me though as I am dealing a lot with M. tuberculosis data and the data from that organism is much different (read length-wise) than E. coli.

If I am being a harsh judge of my own work I would maybe say it would be nice to have done this with a variety of different samples, but sadly it is difficult to get nanopore data with known truth. Within the next year though I should have a treasure trove of TB data with known truth and I would obviously be very happy to update/add a new model with that.

Very happy for you to include the scripts etc.