isugifNF / polishCLR

A nextflow pipeline for polishing CLR assemblies
https://isugifnf.github.io/polishCLR/
16 stars 4 forks source link

add: smaller dataset #38

Closed j23414 closed 1 year ago

j23414 commented 2 years ago

Either as a small genome, or one of several simulated genome options

Option 1: Near ideal case, no repeated sequences in whole genome ACGT AACCGGTT AAACCCGGGTTT... (avoid short reads mapping to multiple locations, near ideal case)

Option 2: Same as option 1, but introduce random errors

Option 3: Same as option 1, but introduce polyploidy

j23414 commented 2 years ago

Went with a small genome.

Adding some notes here from a debugging session.

Illumina and PacBio reads have been subsetted into three files which match to Hzea Chr 1, thanks Ben and Amanda.

Param Files
--illumina_reads "testpolish_{R1,R2}.fq"
--pacbio_reads "test.1.filtered.bam"

These are passed to all three cases:

Case 1

Should only require the addition of the assembly file (or Hzea Chr 1) and a mitochondrial file

 nextflow run isugifNF/polishCLR -r main \
  --primary_assembly "GCF_022581195.2_ilHelZeax1.1_chr1.fa" \. <== HERE
  --mitochondrial_assembly "GCF_022581195.2_ilHelZeax1.1_mito.fa" \ <== HERE
  --illumina_reads "testpolish_{R1,R2}.fq" \
  --pacbio_reads "test.1.filtered.bam" \

Case 2

Should require Case 1 files, plus the alternate assembly file (chr 1) from Falcon unzip but not polished

 nextflow run isugifNF/polishCLR -r main \
  --primary_assembly "GCF_022581195.2_ilHelZeax1.1_chr1.fa" \. <== HERE
  --alternate_assembly "data/alternate.fasta" \ <== pull from Hzea from 3-unzip folder
  --mitochondrial_assembly "GCF_022581195.2_ilHelZeax1.1_mito.fa" \ <== HERE
  --illumina_reads "testpolish_{R1,R2}.fq" \
  --pacbio_reads "test.1.filtered.bam" \

Case 3

Should require Case 1 files, plus the alternate assembly file (chr 1) from Falcon unzip polished

 nextflow run isugifNF/polishCLR -r main \
  --primary_assembly "GCF_022581195.2_ilHelZeax1.1_chr1.fa" \. <== HERE
  --alternate_assembly "data/alternate.fasta" \ <== pull from Hzea from 4-polish folder
  --mitochondrial_assembly "GCF_022581195.2_ilHelZeax1.1_mito.fa" \ <== HERE
  --illumina_reads "testpolish_{R1,R2}.fq" \
  --pacbio_reads "test.1.filtered.bam" \

Trio dataset (Optional but out of scope)

Did we want to consider providing a minimum paternal/maternal trio dataset?

j23414 commented 1 year ago

Just saw, thanks!

Astahlke commented 1 year ago

Smaller test datasets have been added to https://data.nal.usda.gov/dataset/data-polishclr-example-input-genome-assemblies

[ NOTE - Data files added 2022-11-01:

j23414 commented 1 year ago

On it, thank you!

j23414 commented 1 year ago

Just an update that I'm running the CI tests in a separate repo before merging. Want to check for data transfer/runtime limits.

j23414 commented 1 year ago

I'll vetoing running the test data in CI, since 6hrs to download & run in github ci would delay testing and merging code.

Nextflow stub test cancelled in 6h 0m 15s

Remaining tasks for this issue include adding test data instructions at the top of

j23414 commented 1 year ago

https://github.com/isugifNF/polishCLR/pull/63

j23414 commented 1 year ago

@Astahlke @Sivanandan can we close this issue?

Sivanandan commented 1 year ago

Yep! I think we can.