maximilianh / crisporWebsite

All source code of the crispor.org website
http://crispor.org
Other
68 stars 43 forks source link

Error when putting the genome onto the ramdisk #40

Closed Lukas-1 closed 1 year ago

Lukas-1 commented 4 years ago

Hello, and many thanks for writing CRISPOR!!

I have been able to locally install CRISPOR on an Ubuntu 18.04 OS, in order to run it from the command line. I downloaded the hg38 genome from http://crispor.tefor.net/genomes/hg38anset/, and everything works just fine. The off-target and on-target scores are identical to those from the CRISPOR website.

However, it is rather slow. I tried to follow your advice here on GitHub to put the genome on the RAMdisk. However, I keep encountering the same error:

` INFO:root:Using bedtools and genome fasta on ramdisk, /dev/shm/hg38anset.fa index file /dev/shm/hg38anset.fa.fai not found, generating... Traceback (most recent call last): File "bin/filterFaToBed", line 182, in main() File "bin/filterFaToBed", line 151, in main if bool(int(isRep)): ValueError: invalid literal for int() with base 10: '0::chr1:18216-18239(-)'

real 0m2.599s user 0m1.981s sys 0m0.565s ERROR:root:Error: could not run command set -o pipefail; time bedtools getfasta -s -name -fi /dev/shm/hg38anset.fa -bed /tmp/crisporjom1oC/KU3v9oQSzWkIKX81pqP4.matches.bed -fo /dev/stdout | bin/filterFaToBed /tmp/crisporjom1oC/KU3v9oQSzWkIKX81pqP4.fa NGG NAG,NGA 1.0 > /tmp/crisporjom1oC/KU3v9oQSzWkIKX81pqP4.filtMatches.bed. `

I get this error regardless of whether I use FASTA files or .bed files as the input. Do you have any idea what might be causing this issue?

And do you have any other tips for speeding up the scoring of many thousands of guides (with a .bed file as the input)?

maximilianh commented 4 years ago

Hi Lukas,

slowness: can you tell me a little how exactly you're running it? How much RAM does your machine have? How big is your input file?

The error: hm, this is bad... does this only happen when you use the ramdisk? So it worked just fine without the ramdisk, but now it's throwing the error?

maximilianh commented 4 years ago

Also, depending on what you're doing flashfry may be the better tool for your application. If you're scoring many many guides, it was designed for that.

Lukas-1 commented 4 years ago

Yes, exactly. It works fine without the RAMdisk, but as soon as I copy the genome onto the RAMdisk, it gives me this error: "invalid literal for int() with base 10: '0::chr1:18216-18239(-)'" or similar, in "filterFaToBed".

I am running CRISPOR on an Ubuntu 18.04 virtual machine using VirtualBox, and have allotted it 24 GB of RAM.

The input file is a .bed file 2154 KB in size, containing the locations of 14177 guides (a CRISPR sub-library). I want to calculate Doench efficacy scores and CFD scores for all off-targets.

If FlashFry finds exactly the same off-targets, it might be a good option. (I have noticed that CRISPOR returns different, often more, off-targets than GuideScan, so I expect some minor differences between CRISPOR and FlashFry. However, I would prefer to use CRISPOR, since the web platform is commonly used and I consider it the "gold standard" ;-), and my local installation gives identical results to the website.)

maximilianh commented 4 years ago

It may take a bit until I've fixed this, especially with the effect of holidays. Can you run crispor for now without the ramdisk speedup? Wouldn't that be fast enough for you? It should be a lot faster than waiting for me. I'm really surprised by this bug, as we've used this with bedtools many times before. I wonder if it has to do with the version of bedtools.

On Sat, Dec 28, 2019 at 11:05 PM Lukas-1 notifications@github.com wrote:

Yes, exactly. It works fine without the RAMdisk, but as soon as I copy the genome onto the RAMdisk, it gives me this error: "invalid literal for int() with base 10: '0::chr1:18216-18239(-)'" or similar, in "filterFaToBed".

I am running CRISPOR on an Ubuntu 18.04 virtual machine using VirtualBox, and have allotted it 24 GB of RAM.

The input file is a .bed file 2154 KB in size, containing the locations of 14177 guides (a CRISPR sub-library). I want to calculate Doench efficacy scores and CFD scores for all off-targets.

If FlashFry finds exactly the same off-targets, it might be a good option. (I have noticed that CRISPOR returns different, often more, off-targets than GuideScan, so I expect some minor differences between CRISPOR and FlashFry. However, I would prefer to use CRISPOR, since the web platform is commonly used and I consider it the "gold standard" ;-), and my local installation gives identical results to the website.)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/40?email_source=notifications&email_token=AACL4TK7U5RSDWJXLOJHMILQ27ESDA5CNFSM4J6LIK5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHYTASQ#issuecomment-569454666, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TNKSW55FIUAAGUGHILQ27ESDANCNFSM4J6LIK5A .

maximilianh commented 4 years ago

Even if takes a few second for every guide, your BED file should still be processed after a night or so, so I guess, it's easiest for you, for now, to now use the ramdisk?

On Mon, Dec 30, 2019 at 4:15 PM Maximilian Haeussler maximilianh@gmail.com wrote:

It may take a bit until I've fixed this, especially with the effect of holidays. Can you run crispor for now without the ramdisk speedup? Wouldn't that be fast enough for you? It should be a lot faster than waiting for me. I'm really surprised by this bug, as we've used this with bedtools many times before. I wonder if it has to do with the version of bedtools.

On Sat, Dec 28, 2019 at 11:05 PM Lukas-1 notifications@github.com wrote:

Yes, exactly. It works fine without the RAMdisk, but as soon as I copy the genome onto the RAMdisk, it gives me this error: "invalid literal for int() with base 10: '0::chr1:18216-18239(-)'" or similar, in "filterFaToBed".

I am running CRISPOR on an Ubuntu 18.04 virtual machine using VirtualBox, and have allotted it 24 GB of RAM.

The input file is a .bed file 2154 KB in size, containing the locations of 14177 guides (a CRISPR sub-library). I want to calculate Doench efficacy scores and CFD scores for all off-targets.

If FlashFry finds exactly the same off-targets, it might be a good option. (I have noticed that CRISPOR returns different, often more, off-targets than GuideScan, so I expect some minor differences between CRISPOR and FlashFry. However, I would prefer to use CRISPOR, since the web platform is commonly used and I consider it the "gold standard" ;-), and my local installation gives identical results to the website.)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/maximilianh/crisporWebsite/issues/40?email_source=notifications&email_token=AACL4TK7U5RSDWJXLOJHMILQ27ESDA5CNFSM4J6LIK5KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHYTASQ#issuecomment-569454666, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACL4TNKSW55FIUAAGUGHILQ27ESDANCNFSM4J6LIK5A .

Lukas-1 commented 4 years ago

Yes, it certainly works for now. If I want to annotate whole-genome libraries, I may have parallelize it by breaking the .bed files into chunks, and running CRISPOR on separate virtual machines. :)

I tried uninstalling bedtools (I had been using version 2.26.2) and using the newest version of bedtools (2.29.2) I found here, but I got the same error.

maximilianh commented 1 year ago

I imagine you got your problem solved. Closing this ticket now.