brownmk commented 8 years ago

To reproduce, use Fasta file:

s1 GGCCGACCTGTCGCTGACGCAGG

Use input file: NNNNNNNNNNNNNNNNNNNNNNN GGCCGACCTGTCGCTGACGCAGG 5

Get a segmentation fault. However, if we use Fasta file, it works.

s1 AGGCCGACCTGTCGCTGACGCAGG

Could you help take a look? Thanks!

pjb7687 commented 8 years ago

Firstly the first line of the input file should be location of fasta file, not a pattern. Please edit your input file and try again.

And the fasta file should start with '>' for each sequence!

brownmk commented 8 years ago

Hi, sorry for the confusion. I actually have the correct input files, I was only providing the info. Here is an example:

cas-offinder> cat t.in bg NNNNNNNNNNNNNNNNNNNNNNN GGCCGACCTGTCGCTGACGCAGG 5 cas-offinder> ls -lt bg total 0 -rwxrwxrwx 1 owner Users 28 Apr 13 09:08 test.fa cas-offinder> cat bg/test.fa

s1 GGCCGACCTGTCGCTGACGCAGG cas-offinder> ./cas-offinder t.in C out Total 1 device(s) found. Loading input file... Reading bg/test.fa... Sending data to devices... Chunk load started. 1 devices selected to analyze... Finding pattern in chunk #1... Comparing patterns in chunk #1...

Segmentation fault

If we prefix "N" to the s1 sequence, it works correctly.

brownmk commented 8 years ago

Some other suggestions: (1) cas-offinder currently can only take a directory name as the first input line, it would be really great if it can use a sequence file name instead. As we keep the hg38.fa file in a folder that contains many other sequence files. If I made a folder and create a symbolic link that points to hg38.fa, cas-offinder does not seem to be able to follow a symbolic link. (2) it will be great if we can specify the number of CPU/GPU to be used. We have a shared GPU cluster, if some GPUs are already in use, it will be great if we can tell cas-finder not to use all GPU devices. Thank you for your kind consideration!

pjb7687 commented 8 years ago

Thank you so much for your bug report and suggestions.

0) I fixed the bug! 1) Now Cas-OFFinder correctly follows symlinks. In addition, if a chromosome file is specified instead of the directory then it will be used for analysis. 2) By specifying an integer number after C/G/A (e.g. cas-offinder input.txt G10 output.txt), user can limit the maximum number of devices for analysis. If user don't specify any number (as previous version), then Cas-OFFinder will try to use maximum number of GPUs installed on system.

You can checkout and pull the latest version of develop branch for the above changes. They will be applied to master branch & released public after some testing.

Best, Jeongbin

brownmk commented 8 years ago

Thank you so much for implementing all these requests so quickly!! There seems to be some typos in the development branch: (1) It does not compile well under Linux read_fasta.cpp: input.open(buf); # buf should be path_buf? read_twobit.cpp: input = fopen(buf, "rb"); # same, buf should be path_buf Also, the follow does not compile. We probably need to define 'FILE input' first, outside the ifdef/else Then change FILE input=... into input=...

ifdef _WIN32

    FILE *input = fopen(filepath.c_str(), "rb");

else

... if (path_cnt >= 0) FILE input = fopen(path_buf, "rb"); else FILE input = fopen(filepath.c_str(), "rb"); (2) I made these changes and tested it, it does not seem to take a number after "G" ./cas_dev t.in G4 ooo clCreateContext Failed: -33

(3) Since you are so open to suggestions, I have one last one. I try to use cas-offinder to implement what Broad group did in their 2015 paper. Since they allow up to 5 mismatches, cas-offinder will generate tons of output (for the many sgRNAs, I will be looking at GBs' output). So I plan to modify cas-offinder, so that it can dump output to stdout, then pipe the stdout to Broad's cfd-score-calculation and only keep the lines that has significant off-target scores. E.g., I will run: cas-offinder input G - | cdf-score-calculation > cas_out.txt So when output file name is "-", we can write to stdout, I guess you probably can support that. Now the question is you have other message lines also goes to stdout (instead of stderr), it is hard for cdf-score-calculation to figure out which line contains message, which contains actual data. I plan to modify cas-finder to prefix all data line with a character "@", if output file is "-", you might have a more creative and cleaner solution to this use case.

Thanks again for your effort in providing such a high-performance piece of program!

pjb7687 commented 8 years ago

1) fixed! 2) It works in my environment. Anway -33 is CL_INVALID_DEVICE, it occurs when the device is invalid or not accessible. Would you try other numbers again? 3) Nice idea. I think it is easy, just I would put the message lines towards the stderr. I will implement it soon.

pjb7687 commented 8 years ago

Done. Would you test it again?

brownmk commented 8 years ago

Thanks! It works great. "-" triggers output to stdout! I am not familiar with cuda, but I cannot see to get Gx to work. As shown below, this computer has 6 GPUs, I tried number 1 to 6, only 6 works.

$ ./cas_dev t.in G out Total 6 device(s) found. Loading input file... Reading all_exonome.fa... Sending data to devices... Chunk load started. 6 devices selected to analyze... Finding pattern in chunk #1... Comparing patterns in chunk #1... 6.4111 seconds elapsed. $ ./cas_dev t.in G1 out Segmentation fault $ ./cas_dev t.in G2 out clCreateContext Failed: -33 $ ./cas_dev t.in G3 out Segmentation fault $ ./cas_dev t.in G4 out clCreateContext Failed: -33 $ ./cas_dev t.in G5 out Segmentation fault $ ./cas_dev t.in G6 out Total 6 device(s) found. Loading input file... Reading all_exonome.fa... Sending data to devices... Chunk load started. 6 devices selected to analyze... Finding pattern in chunk #1... Comparing patterns in chunk #1... 6.19232 seconds elapsed.

pjb7687 commented 8 years ago

I fixed it, there was an error while counting the number of available devices for each platform.

brownmk commented 8 years ago

Perfect, works as expected! (The only minor thing is you might want to update the usage text to show the new features)

Also, just an interesting thing (I test to query 6 sequences against human exonome) Performance: G6: 13.7023 seconds elapsed. G5: 9.86336 seconds elapsed. G4: 11.3326 seconds elapsed. G3: 7.96847 seconds elapsed. G2: 9.24136 seconds elapsed. G1: 3.79438 seconds elapsed.

Thanks again for making all the improvements so quickly, this tools is perfect now!

$ ./cas_dev t.in G6 - > /dev/null Total 6 device(s) found. Loading input file... Reading all_exonome.fa... Sending data to devices... Chunk load started. 6 devices selected to analyze... Finding pattern in chunk #1... Comparing patterns in chunk #1... 13.7023 seconds elapsed.

brownmk commented 8 years ago

Maybe you have a bug, G1 actually means using all GPUs and G6 means use only one? The performance data above seems odd.

pjb7687 commented 8 years ago

I think because I/O latency dominates when Cas-OFFinder analyze such a small number of targets, I expect different result when the number of targets becomes so large. Well, I will test it anyway.

brownmk commented 8 years ago

Hi, I have trouble with GRID K520. I try to run the GPU job on Amazon GPU instance, now, which uses GRID K520. The code has segmentation fault during initOpenCL. My command line is : cas-offinder t.in G -

I added a few cerr line into initOpenCL()

    m_devnum = 0;
    cerr << "platform_cnt=" << platform_cnt << endl;
    for (i = 0; i < platform_cnt; i++) {
            platform_maxdevnum = maxdevnum - m_devnum;
            cerr << "i=" << i << " maxdevnum=" << maxdevnum << " m_devnumn=" << m_devnum << endl;
            if (platform_maxdevnum > 0) {
                    oclGetDeviceIDs(platforms[i], m_devtype, platform_maxdevnum, devices + m_devnum, &device_cnt);
                    m_devnum += MIN(device_cnt, platform_maxdevnum);
            }
            cerr << "Processed: " << i << "  m_devnum=" << m_devnum << endl;
    }

Here is the output (there are only 2 GPUs, however, m_devnum is 10 intead of 1 after process i=0) The old versions seems to run, so the bug is related to the new logic introduced. Thanks!

[ec2-user@ip-172-31-14-64 cas-offinder-develop_new]$ ./cas-offinder ../t.in G - platform_cnt=2 i=0 maxdevnum=1000 m_devnumn=0 Processed: 0 m_devnum=10 i=1 maxdevnum=1000 m_devnumn=10 Processed: 1 m_devnum=11 Segmentation fault

pjb7687 commented 8 years ago

Hi, would you add the same cerr codes to the old version of Cas-OFFinder? I think there should be no difference, so weird.

brownmk commented 8 years ago

With old version: [ec2-user@ip-172-31-14-64 cas-offinder-master]$ ./cas-offinder ... Available device list: Type: CPU, ' Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz' Type: GPU, ' Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz' Type: ACCELERATOR, ' Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz' Type: GPU, 'GRID K520'

I added the following debugging lines: m_devnum = 0; cerr << "A. platform_cnt=" << platform_cnt << " m_devtype=" << m_devtype << " MAX_DEVICE_NUM=" << MAX_DEVICE_NUM << " device_cnt=" << device_cnt << " m_devnum=" << m_devnum << " devices=" << devices << endl; for (i = 0; i < platform_cnt; i++) { cerr << "i=" << i << endl; oclGetDeviceIDs(platforms[i], m_devtype, MAX_DEVICE_NUM - m_devnum, devices + m_devnum, &device_cnt); cerr << "B. platform_cnt=" << platform_cnt << " m_devtype=" << m_devtype << " MAX_DEVICE_NUM=" << MAX_DEVICE_NUM << " device_cnt=" << device_cnt << " m_devnum=" << m_devnum << " devices=" << devices << endl; m_devnum += device_cnt; } cerr << "C. platform_cnt=" << platform_cnt << " m_devtype=" << m_devtype << " MAX_DEVICE_NUM=" << MAX_DEVICE_NUM << " device_cnt=" << device_cnt << " m_devnum=" << m_devnum << " devices=" << devices << endl; Output: [ec2-user@ip-172-31-14-64 cas-offinder-master]$ ./cas-offinder test.in G1 t A. platform_cnt=2 m_devtype=4 MAX_DEVICE_NUM=1000 device_cnt=0 m_devnum=0 devices=0x7ffd30b669a0 i=0 B. platform_cnt=2 m_devtype=4 MAX_DEVICE_NUM=1000 device_cnt=0 m_devnum=0 devices=0x7ffd30b669a0 i=1 B. platform_cnt=2 m_devtype=4 MAX_DEVICE_NUM=1000 device_cnt=1 m_devnum=0 devices=0x7ffd30b669a0 C. platform_cnt=2 m_devtype=4 MAX_DEVICE_NUM=1000 device_cnt=1 m_devnum=1 devices=0x7ffd30b669a0 Total 1 device(s) found.

With the new version: m_devnum = 0; cerr << "A. platform_cnt=" << platform_cnt << " maxdevnum=" << maxdevnum << " m_devtype=" << m_devtype << " device_cnt=" << device_cnt << " m_devnum=" << m_devnum << " devices=" << devices << endl; for (i = 0; i < maxdevnum; i++) { cerr << "i=" << i << endl; oclGetDeviceIDs(platforms[i], m_devtype, maxdevnum - m_devnum, devices + m_devnum, &device_cnt); cerr << "B. platform_cnt=" << platform_cnt << " maxdevnum=" << maxdevnum << " m_devtype=" << m_devtype << " device_cnt=" << device_cnt << " m_devnum=" << m_devnum << " devices=" << devices << endl; m_devnum += device_cnt; } cerr << "C. platform_cnt=" << platform_cnt << " maxdevnum=" << maxdevnum << " m_devtype=" << m_devtype << " device_cnt=" << device_cnt << " m_devnum=" << m_devnum << " devices=" << devices << endl;

[ec2-user@ip-172-31-14-64 cas-offinder-develop_new]$ ./cas-offinder ../t.in G1 - A. platform_cnt=2 maxdevnum=1 m_devtype=4 device_cnt=10 m_devnum=0 devices=0x12370a0 i=0 B. platform_cnt=2 maxdevnum=1 m_devtype=4 device_cnt=10 m_devnum=0 devices=0x12370a0 C. platform_cnt=2 maxdevnum=1 m_devtype=4 device_cnt=10 m_devnum=10 devices=0x12370a0 Segmentation fault [ec2-user@ip-172-31-14-64 cas-offinder-develop_new]$ ./cas-offinder ../t.in G t A. platform_cnt=2 maxdevnum=1000 m_devtype=4 device_cnt=10 m_devnum=0 devices=0x877030 i=0 B. platform_cnt=2 maxdevnum=1000 m_devtype=4 device_cnt=10 m_devnum=0 devices=0x877030 i=1 B. platform_cnt=2 maxdevnum=1000 m_devtype=4 device_cnt=1 m_devnum=10 devices=0x877030 i=2 clGetDeviceIDs Failed: -32

pjb7687 commented 8 years ago

It looks like device_cnt should be initialized to zero before running clGetDeviceIDs. Would you try the latest commit again? Maybe it also should be applied to master, obviously this is a potential bug.

brownmk commented 8 years ago

Hi, I rebooted my Amazon instance and just tested it. Somehow the error is gone. I am no longer able to reproduce it, so I can only assume it my own fault somehow. Thank you so much for your patience, I really appreciate all your helps!

brownmk commented 8 years ago

I am now testing with 100 gRNA as input, the background sequence is at 10% of the genome, so there is decent amount of computation. Run time statistics does seem odd (test with a server with 6 GPU devices, cas-offinder detected six devices correctly).

Running 1 G6 job on 100 sgRNA takes 332.941 sec. This is very strange compared to numbers below. Running 2 G3 jobs on 100 sgRNA each takes ~35 sec. each Running 3 G2 jobs on 100 sgRNA each takes ~ 1190 sec. each Running 6 G1 jobs on 100 sgRNA each (2-4 jobs will crash with: clEnqueueNDRangeKernel Failed: -4)

!/bin/sh

date > t0.start; ./cas-offinder t.in G1 t0.ou & date > t1.start; ./cas-offinder t.in G1 t1.ou & date > t2.start; ./cas-offinder t.in G1 t2.ou & date > t3.start; ./cas-offinder t.in G1 t3.ou & date > t4.start; ./cas-offinder t.in G1 t4.ou & date > t5.start; ./cas-offinder t.in G1 t5.ou & The remaining good runs takes about ~145 sec each.

Looks like there is huge benefit of running jobs as G1, if we can solve the clEnqueueNDRangeKernel issue.

pjb7687 commented 8 years ago

1)

I can only assume it my own fault somehow.

No I don't think so. This is obviously a bug in Cas-OFFinder. The 'device_cnt' variable must be initialized first, according to the spec of OpenCL. It might work sometimes without proper initialization, but it is not a complete solution.

2)

The speed of computation depends on many things. Memory size, vectorization, interface speed of PCIe slots, driver, and so on. In other words, computation time does not always decrease along the number of devices. That means it is so reasonable to include a feature for user to limit the number of computing devices in the future version of Cas-OFFinder.

3)

Because in this case all Cas-OFFinder processes try to use the first device simultaneously, then clEnqueueNDRangeKernel must fail (usually I don't recommend you to run Cas-OFFinder on one device more than 2 instances). Hmm, maybe it would be better to change the meaning of number after 'G', e.g. maximum number of GPU -> index of GPU?

brownmk commented 8 years ago

Hmm, maybe it would be better to change the meaning of number after 'G', e.g. maximum number of GPU -> index of GPU? That would be good, I run more testing, I can tell it makes a lot more sense to run multiple G1 jobs than a job using multiple devices. Do you consider a syntax like G4.0.1.2.4? G4 stands for using 4 devices and don't care which four. G4.0.1.2.4 means using 4 devices with ID 0, 1, 2, 4? Then we can submit G1.0, G1.1, G1.2 ...

pjb7687 commented 8 years ago

Yes, but I would like to suggest below syntax for specifying devices using : and ,.

examples) G1:3 (GPU 1 to 3) G2,4 (GPU 2 and 4) C1 (CPU 1)

rules)

Device ID starts with 1. Dry-run of Cas-OFFinder should print device ID for each type of device.
Both bounds are inclusive for :.
Specifying total number of GPU is not necessary anymore.

What do you think?

brownmk commented 8 years ago

Looks good to me. I assume others won't have a case where they need 1 GPU, but don't know which one is available. In our GPU cluster, we can get the GPU ID of the free devices, so this is not a problem for us.

pjb7687 commented 8 years ago

Well, maybe when one needs to run several processes simultaneously, I think it would be useful. But I am not pretty sure how many people have such a big cluster with lots of GPUs installed in..

Anyway would you mind if I ask the name of GPU cluster that you use? Is it commercially available? It looks pretty nice.

snugel / cas-offinder

Segemetation fault when match is at the very first location #5

Segmentation fault

ifdef _WIN32

else

Thanks again for making all the improvements so quickly, this tools is perfect now!

I added a few cerr line into initOpenCL()

!/bin/sh