joshuagryphon / plastid

Position-wise analysis of sequencing and genomics data
https://plastid.readthedocs.io
Other
36 stars 16 forks source link

Crossmap fails on hg38 unplaced and unlocalized scaffolds #7

Closed grantdjones closed 7 years ago

grantdjones commented 7 years ago

Hello,

I'm having an issue running crossmap on the human genome, specifically it seems like when it gets to the unlocalized or unplaced scaffolds. I have limited the dataset by not using the "alt" versions of the chromosomes, but these other scaffolds seem to be having issues within the program causing it to exit out without a response. Have you encountered this problem?

Grant

joshuagryphon commented 7 years ago

Hi Grant,

Thank you for the heads up. I haven't seen this before. Exiting without an error message makes me suspect that there's something happening at the system level, like an out-of-memory error or a lack of disk space. If this is what is going on, it's possible to limit both the disk and memory footprint by reducing the number of processes (using -p 1), so that chromosomes are processed sequentially instead of simultaneously. Note, this will increase runtime.

I'd like to try to replicate this on my end. When you have a moment could you tell me:

Thanks & best, Josh

grantdjones commented 7 years ago

Hi Josh,

Thanks so much for getting back to me. Running out of memory may be what’s happening. I’ve run this program on my laptop and it took days and then seemed to just stall while reading one of the chromosomes. Sometimes it would be able to overcome the stall and then move on to the next chromosome. I tried to speed it up by adding more parallel processes through the -p modifier (up to 4) and that worked at first…but then caused it to fail eventually. Likely a memory issue!

Is this program supposed to take multiple days to run? I was confused about this because I thought your paper said a “build cross map” for the human genome should take around 4 hours. We are interested in seeing what the percentage of masking will be from a k-mer size of 15 to 34 nucleotides.

Here’s the info you wanted:

Human genome from UCSC analysis data set without alt chromosomes: hg38.analysisSet.2bit

Command: crossmap -p 8 -k 26 --mismatches 2 --sequence_file /data/jonesgd/hg38_analysis_no_alt.fa --sequence_format fasta --bowtie /usr/local/apps/bowtie/1.1.2/bin/bowtie hg38_analysis_no_alt hg38_analysis

Computer info: Macbook Pro 3.1 GHz Intel Core i7 OSX - Sierra 10.12.2 Python 2.7.10 (native OSX install) Bowtie 1.1.2 RAM- 16GB total, with about 8GB free when i start the program

I’ve also tried this on the biowulf server here at NIH. I ran it with 24GB of memory and a -p 8. It failed sometime overnight. You should be happy to know that one of the bioinformatics people installed plastid as a module that can be called by anyone.

Thanks for any help you can give!

Grant

On Dec 20, 2016, at 11:31 AM, Joshua Griffin Dunn notifications@github.com<mailto:notifications@github.com> wrote:

Hi Grant,

Thank you for the heads up. I haven't seen this before. Exiting without an error message makes me suspect that there's something happening at the system level, like an out-of-memory error or a lack of disk space. If this is what is going on, it's possible to limit both the disk and memory footprint by reducing the number of processes (using -p 1), so that chromosomes are processed sequentially instead of simultaneously. Note, this will increase runtime.

I'd like to try to replicate this on my end. When you have a moment could you tell me:

Thanks & best, Josh

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/joshuagryphon/plastid/issues/7#issuecomment-268289425, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVQ_N2cu4gx1Ux3O7O6587RFnTZ63yBtks5rKALPgaJpZM4LRMeM.

joshuagryphon commented 7 years ago

Hi Grant,

Thank you for the details. I'll check this out tonight. In the build crossmap test in the paper, we allowed 0 mismatches. 2 mismatches could substantially inflate the runtime, because it radically increases the search space for alignments. On a large genome, like the human genome, it could take a couple of days.

As far as memory goes, I noticed you're using a fasta version of the human genome rather than a 2bit file. Using a 2bit file will also reduce the memory footprint, because we can take advantage of the 2bit file's index to load sequences when we need them, instead of holding them in memory; right now, using a fasta file requires the human genome to be held in memory, parts of it by multiple processes if using -p 2 or more.

I'll re-run on the machine I used in the paper to get an idea of what the timing might be for a 2bit file with 2 mismatches, and will let you know. That should at least give a benchmark for how long it should take under memory-permissive conditions.

So, please stay tuned. And, let me know if anything develops on your end.

Cheers, Josh

joshuagryphon commented 7 years ago

So, running the build crossmap test from the paper, but using --mismatches 2 instead of --mismatches 0 has substantially increased the runtime: in the last through 14 hours, only chrX and chrY have completed, while chr7 is nearly complete. Memory usage is up more manageably, topping out around 5 Gb total, using a 2bit file and two procresses (2.4 Gb / process, using hg38.2bit).

I was curious about how allowing mismatches might affect runtime, so I did some back-of-the-envelope checks. Naively, if you think about alignment as a matching problem, in which N sequencing reads must be mapped against a given genome, if you allow 0 mismatches, alignment of a set of 26-mers requires one string search per read. So, with 0 mismatches there are N searches to perform.

Allowing 2 mismatches, then there are 10400N searches to perform:

(26 positions where first mismatch can occur * 4 nucleotide variants at that position) * (25 positions for second mismatch * 4 variants) * N sequencing reads = 10400N variants = 10400 searches

Fortunately, bowtie is not naive in how it searches, so allowing 2 mismatches for 26-mers shouldn't make alignment take 10400-fold longer. But it does take more time.

As far as memory goes, 2bit is going to be the way to go. If running on a server with 24 Gb of ram, and you want to keep your own footprint below 10 Gb, try 3 or 4 processes:

crossmap -p 3 -k 26 --mismatches 2 \
           --sequence_file /data/jonesgd/hg38.analysisSet.2bit \
           --sequence_format twobit \
           --bowtie /usr/local/apps/bowtie/1.1.2/bin/bowtie hg38_analysis_no_alt \
           hg38_analysis

And be prepared to wait 5 or 6 days.

Hope this helps! Please let me know how things go.

Best, Josh

grantdjones commented 7 years ago

Thanks for all this info! That rate of you’re describing for getting through the chromosomes is exactly what I have encountered. I guess with 2 mismatches, it’s just going to take a while as you said. I’ve definitely switched to the .2bit format for the genome file and that seems to have helped with memory usage. I think to get a first sense of the amount of multimapping, 0 mismatches should work for now. It’s a concern that my PI, Nicholas Guydosh, and I have been discussing and is apparently something reviewers are starting to ask about. This will especially be a concern once we try to align 15-mers.

Looks like everything is working now, thanks for your help. Are you still involved with the Weissman lab or is this just something you do on the side?

Cheers, Grant

On Dec 21, 2016, at 12:25 PM, Joshua Griffin Dunn notifications@github.com<mailto:notifications@github.com> wrote:

So, running the build crossmap test from the paper, but using --mismatches 2 instead of --mismatches 0 has substantially increased the runtime: in the last through 14 hours, only chrX and chrY have completed, while chr7 is nearly complete. Memory usage is up more manageably, topping out around 5 Gb total, using a 2bit file and two procresses (2.4 Gb / process, using hg38.2bit).

I was curious about how allowing mismatches might affect runtime, so I did some back-of-the-envelope checks. Naively, if you think about alignment as a matching problem, in which N sequencing reads must be mapped against a given genome, if you allow 0 mismatches, alignment of a set of 26-mers requires one string search per read. So, with 0 mismatches there are N searches to perform.

Allowing 2 mismatches, then there are 10400N searches to perform:

(26 positions where first mismatch can occur 4 nucleotide variants at that position) (25 positions for second mismatch 4 variants) N sequencing reads = 10400N variants = 10400 searches

Fortunately, bowtie is not naive in how it searches, so allowing 2 mismatches for 26-mers shouldn't make alignment take 10400-fold longer. But it does take more time.

As far as memory goes, 2bit is going to be the way to go. If running on a server with 24 Gb of ram, and you want to keep your own footprint below 10 Gb, try 3 or 4 processes:

crossmap -p 3 -k 26 --mismatches 2 \ --sequence_file /data/jonesgd/hg38.analysisSet.2bit \ --sequence_format twobit \ --bowtie /usr/local/apps/bowtie/1.1.2/bin/bowtie hg38_analysis_no_alt \ hg38_analysis

And be prepared to wait 5 or 6 days.

Hope this helps! Please let me know how things go.

Best, Josh

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/joshuagryphon/plastid/issues/7#issuecomment-268583034, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVQ_N53snZYGxuOicJVsqyeKbMlpjwQLks5rKWDygaJpZM4LRMeM.

joshuagryphon commented 7 years ago

Hi Grant,

I'm glad to hear things are working! This was a useful discussion for me too. I've updated the crossmap documentation to include notes for running on large genomes, and have added some notes, as well, to one of the tutorials.

The 15-mers will be a challenge, and for those crossmap's approach of whole-genome alignment might be too conservative. But keep me updated, I'd be curious to see what fraction of the genome crossmaps at that length.

These days I am out of the Weissman lab, having (finally) graduated last year. But, I'm still doing a fair amount of sequencing and genomics at my new job at Ginkgo Bioworks. So I plan to keep developing Plastid. That said, I am hoping some other folks will join in at some point, because my time has become a bit more constrained, and there are a lot of additions I'd like to make that have been slow to develop.

If you have any suggestions, please do let me know!

Best, Josh