nanopore-wgs-consortium / NA12878

Data and analysis for NA12878 genome on nanopore
Other
374 stars 93 forks source link

rel6 version #94

Closed PanZiwei closed 4 years ago

PanZiwei commented 4 years ago

Hi,

I have some questions about the NA12878 dataset version and would really appreciate it if you can help.

  1. I found the original NBT paper provide the dataset with PRJEB13021(https://www.ebi.ac.uk/ena/data/view/PRJEB13021) in .tar.gz version. What files are included in the zipped files? Both .fast5 and .fastq files?

  2. What is the difference between rel6 of genomic cDNA of NA12878 released on the GitHub page(https://github.com/nanopore-wgs-consortium/NA12878/blob/master/Genome.md) and the PRJEB13021? My guessing is that the original .fast5 should be the same, but .fastq files are different because they are using different base callers?

  3. What is the relationship between different flow cells? Should I use the data from all flow cells to make sure the reads coverage?

  4. Since the flow cell with ID FAB23716 failed base calling because of R9.0 version problem. Will it have any negative effect if I only used the remaining flow cells for analysis?

  5. How is the compatibility between NA12878 on R9.4 and other datasets on R9.4.1/R9.5/R10.3? If I am interested in detect DNA notifications with my own data generated on other higher version instead of R9.4, can I use NA12878 on R9.4 as the training set still?

  6. Any plan to produce NA12878 with latest R10.3 version?

Thank you so much for your help!

mattloose commented 4 years ago

Hi,

  1. Probably fast5 and fastq - but I wouldn't use those fast5s as they are single file per read and just a lot of pain to work with.
  2. The signal is the same, but the fast5 files are repackaged as multi-fast5s so much easier to handle. The fastq is more up to date as called with a more recent flowcell.
  3. Flowcells all sequence randomly sheared genomic DNA from NA12878. Some use different methods for library preparation (ligation, rapid and ultra). Each flowcell is a random subset of the genome. The answer to this question depends on what you are trying to do.
  4. No - it won't have any negative effect.
  5. No - different pore type have different signals.
  6. Not at present though I suspect such data will be available soon.