ncezid-biome / datasets

Benchmark and toy datasets
MIT License
5 stars 1 forks source link

Plasmid reads #9

Open erinyoung opened 1 year ago

erinyoung commented 1 year ago

I would like to contribute to this effort, but I want to make sure that my methods are sound. I would love feedback and insight.

I think I can create a toy dataset for some plasmids containing AMR genes.

Here's my current plan:

  1. Assemble nanopore reads into genome with flye (or do hybrid assembly with unicycler) to create a closed genome
    • I'd be using existing assemblies, which are mostly Citrobacter and Acinetobacter, but there are some other organisms I could look into if needed
  2. Use minimap2 to map nanopore reads to the assembled genome
  3. Separate nanopore reads by plasmid
  4. Ensure that nanopore read subset re-assembles into a similar plasmid using flye (and perhaps other assemblers like raven?)
  5. Ensure that the nanopore fastq.gz files are "small enough" for github
  6. Use minimap2 to map illumina reads to the assembled genome
  7. Separate illumina reads by plasmid
  8. Ensure that the nanopore + illumina read subset still assembles with unicycler
  9. Ensure that the illumina fastq.gz files are "small enough" for github
  10. Add resultant files to this repo via a PR
lskatz commented 1 year ago

This is interesting, thank you! I think any contribution would be appreciated. I haven't updated the spec yet where we would host the datasets yet but I will brainstorm more. For now, I think a dataset with accessions and perhaps AMR results would be the most helpful. Let me know!

erinyoung commented 1 year ago

Don't thank me just yet.

I've attached a file that may be helpful to you.

There are six columns in this file that designate

There are some caveats to this file. This file may contain assemblies or SRA accessions that are not, yet, publicly available. Also, some of these isolates may have their AMR gene on their chromosome as opposed to a plasmid. I wanted to vet these problems first, but I do not think that I'll have the time for that for awhile.

I may come back and edit filter this information in the future, but it's here if it will start being useful.

LR Seq of ARLN.csv

gbouras13 commented 1 year ago

Hi @erinyoung,

Just came across this issue while I was looking for some more benchmarking datasets for my tool Plassembler which implements a good chunk of what you outline :) It doesn't go to the individual plasmid level though.

It's still a work in progress for now, but just thought I would share. I'm going to implement a "--keep fastqs" flag now I think based on your comments so thanks for that as others may find it useful!

https://github.com/gbouras13/plassembler

George