Closed nschcolnicov closed 6 months ago
Example of issues: CSVs point to local files and cannot be used for testing: e.g.
/groups/dog/llenezet/test-datasets/data/panel/21/panel_2020-08-05_chr21.phased.vcf.gz
Sorry I didn't yet implement a big test as I needed first a reliable datatest set. We should look at how it is done in other pipeline to know where big files are stored.
Hi,
Normally the nextflow run main.nf -profile test,singularity --outdir results
should now work without any problem.
Hi @LouisLeNezet, here are some ideas of full sized datasets. I implemented the 1000G s3 in the quilt pipeline.
Reference panel CSV
: 1000G VCFs in S3 buckets: https://github.com/atrigila/quilt_nextflow/blob/master/assets/samplesheet_reference_full.csv
Sample: The downsampled versions at https://s3.amazonaws.com/gatk-test-data/gatk-test-data-readme.html.
or some other options could be:
vcf
s3://giab/release/NA12878_HG0001/latest/GRCh38/HG0001_NA12878_1_22_v4.2.1_benchmark.vcf.gz or bam
https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/NA12878/alignment.index.NA12878_TruSeq_Exome_Nebraska_GRCh37_09252015
Genome fasta: AWS iGenomes
For the fasta it is ready, same for reference panel with the #18 PR. For the sample the NA12878 is easily accessible but the problem reside in the presence of this individual in the reference panel as well as its parents. For a full test it will imply to duplicate the huge files to remove them to not overestimate the performance of the imputation. The best would be to have a unrelated bam file at high coverage from outside the 1000 Genome Project. The GATK resources seems interesting but there is only the NA12878 individual available...
Description of the bug
Test profiles currently contain local paths, i.e test_full.config
The test profiles that need correcting are: test_full.config test_panelprep.config test_sim.config test.config
Command used and terminal output
No response
Relevant files
No response
System information
dev