Use reference sequences from GRCh38 instead of all_chr.fa

nmdp-bioinformatics / pipeline

Consensus assembly and allele interpretation pipeline.

GNU Lesser General Public License v3.0

7 stars 7 forks source link

Use reference sequences from GRCh38 instead of all_chr.fa #61

Closed ghost closed 9 years ago

ghost commented 9 years ago

Consider using reference sequences directly from GRCh38. In the ftp link

ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/

a full analysis gzipped FASTA file, no_alt analysis gzipped FASTA file, and tar gzipped bwa indices are available.

See also issue #60

ghost commented 9 years ago

@ckennedy-nmdp @janders3-nmdp @mthorsen-nmdp @ybolon-nmdp what do you think about this suggestion?

ybolon-nmdp commented 9 years ago

Okay by me.

ckennedy-nmdp commented 9 years ago

In theory I like it, but we're gonna hit the reference with a lot of I/O. This might present a network problem.

ghost commented 9 years ago

To clarify, I'm suggesting to pull the sequences and their indices from the ftp link above instead of whatever process we're using locally to index and then push reference data to the AWS instances.

ckennedy-nmdp commented 9 years ago

Ah, got it. Agreed then. I'll setup a time for @mthorsen-nmdp and I to get this done.

janders3-nmdp commented 9 years ago

I approve of this approach.

ckennedy-nmdp commented 9 years ago

@janders3-nmdp, @mthorsen-nmdp downloaded the reference and index files yesterday (I believe) -- can we put those in data/reference? Hopefully this will be the first smoke-test of the pipeline -- assuming I can get to that today.

ghost commented 9 years ago

Can we close this issue now?

ckennedy-nmdp commented 9 years ago

Yes

ghost commented 9 years ago

Fixed by pull request #102 and #103