mourisl / T1K

T1K is a versatile methods to genotype highly polymorphic genes (e.g. KIR, HLA) with bulk or single-cell RNA-seq, WGS or WES data.
MIT License
42 stars 6 forks source link

question about post analysis #27

Closed ijuric closed 3 months ago

ijuric commented 3 months ago

Hello, First, thank you for developing this method and building easy to use code. I've ran the code to hla-type 10X scrna-seq samples and so far it's working great at genotyping my individuals. However, I need to use the program with -- skipPostAnalysis tag, since otherwise the analysis runs for several days, at which point my jobs on the cluster are killed.

I'm working with cancer samples. In some cancers, a fraction of cancer cells in the same individual might express only one hla allele, while other cells might express both (and sometimes cells can express neither). In the long run, I'd like to be able to distinguish those cell groups. First step would be to hla-type each cell. As far as I understand, the part of the post analysis that is the 'single-cell representation' can give per-cell allelic abundance estimate. Is that correct?

If so, is there a way to run post analysis in parallel?

For example, (and I'm totally making this up because I don't know how it's ran), if post analysis is running T1K for each cell independently, then I can split _aligned_1.fa _aligned_2.fa by barcodes stored in _aligned_bc.fa and run multiple jobs at the same time, one per barcode with --stage 2 tag. Would that be valid? (and yes, if I run it like that, then I'd end up with bunch of output files which I'd need to deal with, but that's okay as long as the procedure is valid)

Thank you, Ivan

mourisl commented 3 months ago

Thank you for using T1K. The most time-consuming step is inferring the novel variants, which can take a long time in some cases. I will add some parameters to control the time spent on this step so you can use the post-analysis.

mourisl commented 3 months ago

I have updated the constraint as an option in the github repo. Could you please pull the github repo and give the new postanalysis stage a try?

ijuric commented 3 months ago

Thank you! Running it now with default --post-varMaxGroup parameter. I'll let you know how long it took to finish.

ijuric commented 3 months ago

Hello, A quick update. When running t1k with --post-varMaxGroup, analyzer part fails after finishing genotyping:

[Tue Apr  9 10:01:32 2024] SYSTEM CALL: /mnt/beegfs/projects/human/tools/T1K/analyzer  --post-varMaxGroup 1 --barcode P0base.v4_aligned_bc.fa -o P0base.v4 -t 24 -f /mnt/beegfs/projects/human/tools/T1K/hlaidx/hlaidx_rna_seq.fa -a P0base.v4_allele.tsv -1 P0base.v4_aligned_1.fa -2 P0base.v4_aligned_2.fa
/mnt/beegfs/projects/human/tools/T1K/analyzer: unrecognized option '--post-varMaxGroup'
./analyzer [OPTIONS]:
Required:
        -f STRING: fasta file containing the reference genome sequence
        -a STRING: selected alleles list file
        [Read file]
        -u STRING: path to single-end read file
        -1 STRING -2 STRING: path to paired-end files
Optional:
        -t INT: number of threads (default: 1)
        -o STRING: output prefix (defult: t1k)
        -n INT: maximal number of alleles per read (default: 2000)
        -s FLOAT: filter alignments with alignment similarity less than specified value (defalut: 0.8)
        --barcode STRING: path to the barcode file
        --relaxIntronAlign: allow one more mismatch in intronic alignment (default: false)
        --alleleDigitUnits INT: the number of units in genotyping result (default: automatic)
        --alleleDelimiter CHR: the delimiter character for digit unit (default: automatic)
        --varMaxGroup INT: the maximum variant group size to call novel variant. -1 for no limitation (default: 8)
system /mnt/beegfs/projects/human/tools/T1K/analyzer  --post-varMaxGroup 1 --barcode P0base.v4_aligned_bc.fa -o P0base.v4 -t 24 -f /mnt/beegfs/projects/human/tools/T1K/hlaidx/hlaidx_rna_seq.fa -a P0base.v4_allele.tsv -1 P0base.v4_aligned_1.fa -2 P0base.v4_aligned_2.fa failed: 256 at 
/mnt/beegfs/projects/human/tools/T1K/run-t1k line 61.

Probably this is because analyzer uses varMaxGroup rather than post-varMaxGroup (line 47, Analyzer.cpp).
 Changing --post-varMaxGroup to --varMaxGroup in run-t1k (line 101, also 55) and then running it with --varMaxGroup instead of --post-varMaxGroup seems like it fixes the issue. It’s been running for 22 hours with --varMaxGroup 1 now.

mourisl commented 3 months ago

Thank you for finding this issue. I have fixed this issue in the wrapper to the github. I guess the slowness may not comes from variant calling now, or this parameter did not resolve the bottleneck. I'll look into this a bit more.

For this 22-hours run, is it 22 hours after the analyzer step?

ijuric commented 3 months ago

Thank you for fixing the issue! Yes. Everything else is done within ~40 min. Here are the last couple lines of the log file of the current run that's still running:

[Tue Apr  9 12:26:04 2024] SYSTEM CALL: /mnt/beegfs/projects/human/tools/T1K/analyzer  --varMaxGroup 2 --barcode P0base.v4_aligned_bc.fa -o P0base.v4 -t 24 -f /mnt/beegfs/projects/human/tools/T1K/hlaidx/hlaidx_rna_seq.fa -a P0base.v4_allele.tsv -1 P0base.v4_aligned_1.fa -2 P0base.v4_aligned_2.fa
[Tue Apr  9 12:26:07 2024] Found 1975284 read fragments. Start read assignment.
[Tue Apr  9 12:26:10 2024] Finish read end assignments.
[Tue Apr  9 12:26:11 2024] Finish read fragment assignments. 1960892 read fragments can be assigned (average 1.63 alleles/read).
[Tue Apr  9 12:26:11 2024] Finish allele quantification in 11 EM iterations.

Previously (before the introduction of --post-varMaxGroup), for this particular dataset, I managed to finish the run on the same cluster node I'm running it on now. This is what the ending of that log file looked like:

[Fri Apr  5 20:37:30 2024] Found 1975284 read fragments. Start read assignment.
[Fri Apr  5 20:37:33 2024] Finish read end assignments.
[Fri Apr  5 20:37:34 2024] Finish read fragment assignments. 1960892 read fragments can be assigned (average 1.63 alleles/read).
[Fri Apr  5 20:37:34 2024] Finish allele quantification in 11 EM iterations.
[Sun Apr  7 10:14:30 2024] Post analysis finishes.
[Sun Apr  7 10:14:30 2024] Finish.

Once current run finishes, I can let you know how how much faster it was.

mourisl commented 3 months ago

I just pushed an update to github, so you can use "--varMaxGroup 0" to completely skip the variant calling step. Is it possible to share the XX_aligned_1/2.fa, and the XXX_allele.tsv files? I'll check the reason of the slowness. Thank you.

ijuric commented 3 months ago

Yes on .tsv file (would sending it via email be okay?), but I need to check and get back to you about .fa files. Just to confirm, these files only contain only reads that overlap hla loci, correct?

mourisl commented 3 months ago

Email is fantastic. My email is Li.Song@dartmouth.edu .

Just to confirm, these files only contain only reads that overlap hla loci, correct?

Right.

mourisl commented 3 months ago

Thank you for sharing the file! It actually finishes quite fast on our server. When I check your log again, I just notice that you don't have the "-s 0.97" in the command. Did you run T1K with the option "--preset HLA"?

ijuric commented 3 months ago

That was it! I didn't have that part in my call. After adding it, runs finished quickly. Thank you very much for figuring it!

mourisl commented 3 months ago

I'll close this issue for now. If you have found any other issues, please feel free to reopen this one or create another thread.