splicebox / PsiCLASS

Simultaneous multi-sample transcript assembler for RNA-seq data
16 stars 4 forks source link

Experiencing very slow run-time #31

Open tcashby opened 10 months ago

tcashby commented 10 months ago

The paper describing PsiCLASS indicated that the software should run within 1-5 minutes per sample. However, I'm experiencing a much longer runtime (several hours) when processing human plasma cell data aligned to hg38 using HISAT2. It has been running for 15 hours as of this post and still hasn't finished the raw_splice portion.

The code used was:

psiclass --lb bam_list.txt -p 24

As an initial test, bam_list.txt only contains five bam files. The sequencing data is from human plasma cells aligned to hg38 with HISAT2. It appears that the software runs quickly until it encounters the IGK region on chromosome 2. I suspect that regions of high complexity, such as the Immunoglobulin regions in plasma cells, might be causing the extended runtime.

Questions:

  1. Is there a known issue of slowdown around complex regions?
  2. Are there any settings, optimizations, or preprocessing steps I can take to improve the runtime?
mourisl commented 10 months ago
  1. Yes, the program slows down in complex regions. The VJ gene joining regions probably create a lot of false exon boundaries that are near each other, which will confuse PsiCLASS to select the best path.
  2. You can use a smaller value for the option "--maxDpConstraintSize", which is an option mainly for those complex regions. I think in my test, the accuracy is about the same even with values like 5.

Since you mention the IGK region and the plasma cells, if you are interested in their BCR sequences, you can try another software, TRUST4(https://github.com/liulab-dfci/TRUST4), developed during my postdoc time. TRUST4 is designed to assemble TCRs and BCRs from RNA-seq data.

Hope these help.

tcashby commented 10 months ago

At the moment, I'm more interested in novel transcript discovery and less concerned with the VJ regions. Although I will certainly look into TRUST4 when I get to that point.

My current idea (if your maxDpConstraintSize suggestion doesn't fix it) is to just edit the code to ignore those regions. It's not practical in my case to make versions of the bam without them, it would be too time and disk-space intensive. Do you see any issue with adding a blacklist parameter that takes a bed file and ignores the regions specified in it?

mourisl commented 10 months ago

The blacklist parameter is a great suggestion. I think I can implement this in a few days (Hopefully maxDpConstraintSize can resolve your issue so you don't need to wait for me).

Another manual way: based on PsiCLASS's log, you can know the range of the complex region, and you can remove the subexons in those regions from the subexon_combined.out file. You shall use the whole gene region in the log, otherwise the gene structure might be disrupted at the blacklist region and crash the program.

tcashby commented 10 months ago

Looking at it more, I think the issue is upstream of the maxDpConstraintSize parameter.

I haven't had time to fully decipher everything that FindJunction.cpp is doing, but it looks like it took about 18 hours to generate the "raw_splice" files.

After the first run, I added in some rudimentary code to output the starting position of each read so I could tell the positions it was getting 'stuck' on. I then re-ran the program, it went super fast until it got to chr2 around the centromeric region. When I looked into detail at one of the positions (in this case chr2:88966543 in the IGK region) it had over 300,000 reads that started at that position.

mourisl commented 10 months ago

Sorry I missed that point in your first question. The FindJunction.cpp is usually very fast, so it slipped my mind. I think since there are too many reads in the VDJ region each could create slightly different intron junctions, the queue I use to store the junction information might be overflowed and cause an infinite loop or other issues. Could you please change the "#define QUEUE_SIZE 10001" to a large number, i.e. 1000000, in FindJunction.cpp and remake PsiCLASS to give it a try?

tcashby commented 10 months ago

I'll give your change a shot and let you know.

In the meantime, I've been playing around with the code and maybe found a workaround? I'm not sure what unintended effects it will have downstream or what other bits of code I would need to modify. I doubt this issue will just be a problem in FindJunction.cpp.

Anyway, I modified the code as follows (starting on line 886 of the original):

int previous_pos = -1; // add variable to track current position
int read_count_at_pos = 0; // add counter for number of reads seen at current position
while (1)
{
    int flag = 0;
    if (useSam)
    {
        if (b)
            bam_destroy1(b);
        b = bam_init1();
        if (samread(fpsam, b) <= 0)
            break;
        if (b->core.tid >= 0)
            strcpy(col[2], fpsam->header->target_name[b->core.tid]);
        else
            continue;

        if((b->core.pos + 1) == previous_pos)
        {
            read_count_at_pos++;
            if(read_count_at_pos > 100) //if we've already got 100 reads at this position, move on...
            {
                continue;
            }
        }
        else
        {
            previous_pos = b->core.pos + 1;
            read_count_at_pos = 1;
        }

The idea is that if we hit a region with 100s of reads, get the first 100 and ignore the rest.

mourisl commented 10 months ago

I think this strategy might lose real splice sites in some other highly-expressed genes. As the expression is high, it is highly likely to have read start at the same position thousands of times, and the stop-early strategy may miss the alternative 3' acceptor alternative splicing event. The slowness mainly comes from the reads in the region creating too many splicing sites.

tcashby commented 10 months ago

After updating the QUEUE_SIZE, it took about five hours to generate the raw_splice files per bam file. A blacklist would still be nice, but I don't know how you would implement it. If you could utilize the bam index, it would be relatively straightforward, but with the older version of "sam.h" it might be a pretty hefty rewrite of the code.

mourisl commented 10 months ago

I think the current implementation needs to scan most elements in the queue in linear time, which could become the bottleneck in the messy region. I'm currently refactoring the code, and hope to reduce the time to log(n). This should resolve the speed issue.