Plasmid reads - Githubissues

erinyoung commented 1 year ago

I would like to contribute to this effort, but I want to make sure that my methods are sound. I would love feedback and insight.

I think I can create a toy dataset for some plasmids containing AMR genes.

Here's my current plan:

Assemble nanopore reads into genome with flye (or do hybrid assembly with unicycler) to create a closed genome
- I'd be using existing assemblies, which are mostly Citrobacter and Acinetobacter, but there are some other organisms I could look into if needed
Use minimap2 to map nanopore reads to the assembled genome
Separate nanopore reads by plasmid
Ensure that nanopore read subset re-assembles into a similar plasmid using flye (and perhaps other assemblers like raven?)
Ensure that the nanopore fastq.gz files are "small enough" for github
Use minimap2 to map illumina reads to the assembled genome
Separate illumina reads by plasmid
Ensure that the nanopore + illumina read subset still assembles with unicycler
Ensure that the illumina fastq.gz files are "small enough" for github
Add resultant files to this repo via a PR

lskatz commented 1 year ago

This is interesting, thank you! I think any contribution would be appreciated. I haven't updated the spec yet where we would host the datasets yet but I will brainstorm more. For now, I think a dataset with accessions and perhaps AMR results would be the most helpful. Let me know!

erinyoung commented 1 year ago

Don't thank me just yet.

I've attached a file that may be helpful to you.

There are six columns in this file that designate

Organism: predicted organism (many of which have changed since submission)
ID : ARLN ID of the isolate in case googling is needed
Illumina SRA: SRA accession of paired-end Illumina reads
nanopore SRA: SRA accession of nanopore reads
Accessions: NCBI genomes accessions of chromosome and plasmids
- These are listed from initial accession to last accession (for example: CP118189-CP118194 actually means CP118189, CP118190, CP118191, CP118192, CP118193, and CP118194)
AMR: potential AMR gene located in sequence

There are some caveats to this file. This file may contain assemblies or SRA accessions that are not, yet, publicly available. Also, some of these isolates may have their AMR gene on their chromosome as opposed to a plasmid. I wanted to vet these problems first, but I do not think that I'll have the time for that for awhile.

I may come back and edit filter this information in the future, but it's here if it will start being useful.

LR Seq of ARLN.csv

gbouras13 commented 1 year ago

Hi @erinyoung,

Just came across this issue while I was looking for some more benchmarking datasets for my tool Plassembler which implements a good chunk of what you outline :) It doesn't go to the individual plasmid level though.

It's still a work in progress for now, but just thought I would share. I'm going to implement a "--keep fastqs" flag now I think based on your comments so thanks for that as others may find it useful!

https://github.com/gbouras13/plassembler

George

ncezid-biome / datasets

Plasmid reads #9