ohlerlab / PARpipe

Complete analysis pipeline for PAR-CLIP data
10 stars 6 forks source link

a issue about the sam file #6

Open mississippi2020 opened 1 year ago

mississippi2020 commented 1 year ago

when I run the paralyzer, not the parpipe, I met some problems, here is the specific information: 1673246384(1) does anyone know how to solve it?

MSajek commented 1 year ago

I had exactly the same issue. It looks like it can be caused by very long reads. Filtering fastq files in preprocessing step can help here (e.g. filter to long reads during adapter trimming step). If not you can try to identify which line in input sam file cause an error. I analyzed small files and I was able to it manually, but for larger input files script can be required. First, you should check the last line in your_file_PARalyzer_Utilized.sam tail -n 1 your_file_PARalyzer_Utilized.sam That's last line utilized by PARalyzer that did not produced error. Then, check number of this line in your original input.sam. Let's assume that read name in this line is read_10000. grep -n 'read_10000' input.sam Let's assume that line number is 10000, then you produce your tmp.sam by increasing this number by e.g. 1000 head -n 11000 inmut.sam > tmp.sam Run PARalyzer with tmp.sam and check if it is producing error. If not, increase more (e.g. next 1000). If yes, you know that line causing error is between 10001 and 11000 and you can start next iteration e.g. using half value (10500) to see if wrong line is between 10001 and 10500 or 10501 and 11000. In this way with relatively low number of iterations you should be able to identify line which cause an error and delete it by e.g. grep -v 'read_name_in_wrong_line' input.sam > input_corrected.sam

santataRU commented 4 months ago

Hi, MSajek, Thanks for sharing. What do you mean by very long reads? Are reads with lengths greater than 100 nt considered long reads? Xiao

MSajek commented 4 months ago

Hi Xiao, After I've got this error and figured out why it is occurring I always set max length threshold to 50 (on adapter trimming stage I add -M 50 parameter to cutadapt). With such setup I never got this error again. Marcin

santataRU commented 4 months ago

Hi MSajek,

Thank you for your input. I am encountering a similar error message (attached below) that states "string index out of range." Interestingly, this issue arises only with one of the three similar datasets I am working with. The minimum and maximum read lengths for these datasets are 20 nt and 140 nt, respectively, which makes this situation quite puzzling.

By the way, I have not received any replies or assistance when I reach out to the PARalyzer developers. It appears that members of the Ohler lab (including Dr. Ohler himself) do not respond to inquiries about the tools they develop, whether on GitHub or the Biostar forum. Your comments here have been helpful and appreciated.

Best regards,

Xiao

The error message in my case:

Running PARalyzer v1.5 Parsing SAM file(s)...Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -123 at java.base/java.lang.StringLatin1.charAt(StringLatin1.java:47) at java.base/java.lang.String.charAt(String.java:693) at PARCLIPsamParser.parseFile(PARCLIPsamParser.java:84) at PARalyze.main(PARalyze.java:118)

MSajek commented 4 months ago

Hi Xiao, It is definitely wort trying to identify read causing error. In my dataset I shouldn't have reads longer than 35-40 and the one causing error was > 100. So sometimes weird things happen. Sometimes is not easy to get the feedback from developers, especially when they are currently not working in academia. And I can imagine Uwe is really busy and you probably must email him directly to get any feedback from him. Marcin

santataRU commented 4 months ago

Hi Marcin,

Thank you for your response. Given that all of my datasets contain reads longer than 100 nt, it suggests that only some of the longer reads may be causing the problem (otherwise, I would encounter this error across all datasets). While it is worth attempting to identify the reads causing errors, I lack the computational biology background to modify the PARalyzer tool and fix the bug.

I have emailed Uwe twice before, but he never responded. As you mentioned, some of the trainees in his lab are no longer working in academia, leaving no one to address these bugs. I hope there are more alternatives to PARalyzer available (I know WavClusteR is one of them).

Thank you again for your help.

Best regards,

Xiao

MSajek commented 4 months ago

Hi Xiao, Based on your comment I assume that you have paired-end sequencing, which of course is standard today. For PAR-CLIP analysis you should use only read that has mRNA sequence (depended on the library preparation protocol it may be read1 or read2). Read with cDNA sequence should be discarded. Try to re-run your pipeline with only one read and let me know if you still have the error. Also can you provide more details about your analysis? Unfortunately I'm not coding in java, so I can't fix the bug in PARalyzer. As an alternative I can propose omniCLIP also from Ohler lab. Installation is not easy, but when you once set everything properly, than it should work. It require much more computational resources and run much longer (~16-48 h per analysis), but it also produce good results. I can guide you through the installation process if you want. Best, Marcin

santataRU commented 4 months ago

Hi Marcin,

Thank you very much! I only perform single-end sequencing (not paired-end) for PAR-CLIP experiments. After pre-processing my fastq datasets (using Cutadapt to remove adapters, fastx_barcode_splitter to split barcodes, fastx_collapser to collapse reads with identical UMIs, and fastx_trimmer for 5' trimming of adapters), I map the post-processed reads (in fasta format) to hg38 with Bowtie. I then convert the sorted and indexed bam files to sam files containing only the mapped reads, and the sam file is used as input for PARalyzer.

A detailed example is below (raw data: fastq from 1x150bp single-end sequencing):

1) adaptor removing: cutadapt -m 28 -a AGATCGGA -o CLIP70_cutadapt.fastq CLIP70.fastq.gz --max-n 0 --discard-untrimmed -j 0 | tee CLIP70_cutadapt.txt

2) barcodes splitting: cat CLIP70_cutadapt.fastq | /PATH/fastx_barcode_splitter.pl --bcfile /PATH/barcodesXL.txt --bol --mismatches 0 --prefix CLIP70 2>&1 | tee CLIP70_cutadapt_bcodes.txt

3) collapsing reads with the same UMI: fastx_collapser -v -i CLIP70BC1 -o CLIP70BC1_col 2>&1 | tee CLIP70BC1_col.txt

4) trimming 5' adaptor: fastx_trimmer -f 9 -v -i CLIP70BC1_col -o CLIP70BC1_col_trim

5) bowtie mapping reads to hg38:

bowtie -f -v 2 -m 1 --best --strata -p 8 -S /PATH/hg38/bowtieindex/hg38_genome CLIP70BC1_col_trim 2> CLIP70BC1_m1_bowtie.txt |\
samtools view -bhS - | \
samtools sort - \
 > CLIP70BC1_HA-RBP_m1.sorted.bam; samtools index CLIP70BC1_HA-RBP_m1.sorted.bam 

6) convert sorted and indexed bam file to sam files containing only the mapped reads, and the sam file is used as input for PARalyzer. samtools view -h -F 4 CLIP70BC1_HA-RBP_m1.sorted.bam > CLIP70BC1_HA-RBP_m1.mapped.sam

7) run PARalzyer V1.5.0

step 1, setting up a .ini file with the filename "CLIP70BC1_HA-RBP_hg38_m1.ini".

BANDWIDTH=3
CONVERSION=T>C
MINIMUM_READ_COUNT_PER_GROUP=5
MINIMUM_READ_COUNT_PER_CLUSTER=5
MINIMUM_READ_COUNT_FOR_KDE=5
MINIMUM_CLUSTER_SIZE=11
MINIMUM_CONVERSION_LOCATIONS_FOR_CLUSTER=1
MINIMUM_CONVERSION_COUNT_FOR_CLUSTER=1
MINIMUM_READ_COUNT_FOR_CLUSTER_INCLUSION=5
MINIMUM_READ_LENGTH=1
MAXIMUM_NUMBER_OF_NON_CONVERSION_MISMATCHES=2

EXTEND_BY_READ

GENOME_2BIT_FILE=/PATH/Index/hg38/hg38.2bit
SAM_FILE=/PATH/CLIP70BC1_HA-RBP_m1.mapped.sam
OUTPUT_DISTRIBUTIONS_FILE=/PATH/CLIP70BC1_HA-RBP_m1_hg38_Paralyzer_distri.csv
OUTPUT_GROUPS_FILE=/PATH/CLIP70BC1_HA-RBP_m1_hg38_Paralyzer_group.csv
OUTPUT_CLUSTERS_FILE=/PATH/CLIP70BC1_HA-RBP_m1_hg38_Paralyzer_clusters.csv

step 2. run PARalyzer ./PARalyzer 32G CLIP70BC1_HA-RBP_hg38_m1.ini

I have attached a link at the end to access my data, which includes fasta files, sorted.bam files, mapped.sam files, and .ini files for the three datasets used for PARalyzer. If you want to replicate the error from your end, please note that only one dataset is giving me an error.

I have some questions regarding omniCLIP:

1) Are you running omniCLIP on Mac systems with M series chips or Intel chips?

2) PAR-CLIP experiments do not include background controls (and thus lack background RNA-seq files). Can I use omniCLIP to process PAR-CLIP experimental datasets?

3) Did you use Conda to install omniCLIP? It appears to me that there are no Conda options to install the tool.

Thank you,

Xiao

Link: https://drive.google.com/drive/folders/1131joG6NS264iVMYvnWTz3iU2QmnEH7B?usp=sharing

MSajek commented 4 months ago

For omniCLIP some background control is required, that's the important limitation. Yes, I installed omniCLIP by conda. I will take a look to your files I think we can switch to email communication until we find reason for the error to not spam github and post here only the final solution. Can you send me your email address to: marcin.sajek@gmail.com

MSajek commented 4 months ago

To summarize, if someone find similar issue - again it was one weird read. From error message: Parsing SAM file(s)...Exception in thread "main" java.lang.StringIndexOutOfBoundsException: Index -123 out of bounds for length 138 at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:55) at java.base/jdk.internal.util.Preconditions$1.apply(Preconditions.java:52) at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:213) at java.base/jdk.internal.util.Preconditions$4.apply(Preconditions.java:210) at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:98) at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106) at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302) at java.base/java.lang.String.checkIndex(String.java:4575) at java.base/java.lang.StringLatin1.charAt(StringLatin1.java:46) at java.base/java.lang.String.charAt(String.java:1535) at PARCLIPsamParser.parseFile(PARCLIPsamParser.java:84) at PARalyze.main(PARalyze.java:118) one can see that read causing error has length 138, so reads with such length can be tested one by one or all removed from *.sam file. After removing one of 3 reads with such length for this particular case everything starts to work properly. Marcin

santataRU commented 4 months ago

Thank you very much, Marcin.

I examined the problematic read (sequence attached at the end), and as you mentioned, there is no obvious feature indicating that this read is problematic. It can be mapped unambiguously to an exon of the ALPK3 gene on chromosome 15. Since this problem is reproducible on your end, and I was able to replicate your solution on my end, it is strange that PARalyzer fails with this read.

Best regards,

Xiao

>1767224-1
CAGACAGGAGTGCACAGAAGGGCATGATGACACAGGGAAGGGCAGAGACACAGCTAGAAACAACACAGGCAGGCGAGAAGATACAGGAAGACAGGAAGGCCCAGGCAGATAAGGGCACACAGGAAGACAGAAGTATGC