mortazavilab / lapa

Alternative polyadenylation detection from diverse data sources such as 3'-seq, long-read and short-reads.
https://www.biorxiv.org/content/10.1101/2022.11.08.515683v1
23 stars 13 forks source link

Support for non stranded paired ended data #13

Closed margaretc-ho closed 1 year ago

margaretc-ho commented 1 year ago

Hi there, I encountered this error when running LAPA on single short read BAM file. What do you advise to solve this? Thanks!

lapa --alignment ${illumina_bam_dir}/${bamfile} --fasta ${reference_genome_fa} --annotation ${reference_gtf} --chrom_sizes ${chrom_sizes} --output_dir ${outdir}/vb_annot/${samplename}_illumina        

Error:

Traceback (most recent call last): File "/hpcdata/bcbb/homc/conda_envs/envs/lapa_env/bin/lapa", line 8, in sys.exit(cli_lapa()) File "/hpcdata/bcbb/homc/conda_envs/envs/lapa_env/lib/python3.9/site-packages/click/core.py", line 1130, in call return self.main(args, kwargs) File "/hpcdata/bcbb/homc/conda_envs/envs/lapa_env/lib/python3.9/site-packages/click/core.py", line 1055, in main rv = self.invoke(ctx) File "/hpcdata/bcbb/homc/conda_envs/envs/lapa_env/lib/python3.9/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "/hpcdata/bcbb/homc/conda_envs/envs/lapa_env/lib/python3.9/site-packages/click/core.py", line 760, in invoke return __callback(args, **kwargs) File "/hpcdata/bcbb/homc/conda_envs/envs/lapa_env/lib/python3.9/site-packages/lapa/main.py", line 112, in cli_lapa lapa(alignment, fasta, annotation, chrom_sizes, output_dir, File "/hpcdata/bcbb/homc/conda_envs/envs/lapa_env/lib/python3.9/site-packages/lapa/lapa.py", line 497, in lapa _lapa(alignment) File "/hpcdata/bcbb/homc/conda_envs/envs/lapa_env/lib/python3.9/site-packages/lapa/lapa.py", line 288, in call df_all_count, sample_counts = self.counting(alignment) File "/hpcdata/bcbb/homc/conda_envs/envs/lapa_env/lib/python3.9/site-packages/lapa/lapa.py", line 142, in counting df_all_count, sample_counts = counter.to_df() File "/hpcdata/bcbb/homc/conda_envs/envs/lapa_env/lib/python3.9/site-packages/lapa/count.py", line 583, in to_df df = pd.concat([ File "/hpcdata/bcbb/homc/conda_envs/envs/lapa_env/lib/python3.9/site-packages/lapa/count.py", line 584, in self.build_counter(row['path']) File "/hpcdata/bcbb/homc/conda_envs/envs/lapa_env/lib/python3.9/site-packages/lapa/count.py", line 142, in to_df return self.to_gr().df.astype({'Chromosome': 'str', 'Strand': 'str'}) File "/hpcdata/bcbb/homc/conda_envs/envs/lapa_env/lib/python3.9/site-packages/pandas/core/generic.py", line 6212, in astype raise KeyError( KeyError: "Only a column name can be used for the key in a dtype mappings argument. 'Chromosome' not found in columns."

My chrom.sizes file for Anopheles gambiae looks like this, in case that helps (it was generated using samtools faidx as instructed)

AgamP4_2L 49364325 AgamP4_2R 61545105 AgamP4_3L 41963435 AgamP4_3R 53200684 AgamP4_UNKN 42389979 AgamP4_X 24393108 AgamP4_Y_unplaced 237045 AAAB01000047 21505 AAAB01000163 28420 AAAB01000448 22809 AAAB01000791 62303 (..more contigs..) AgamP4_Mt 15363

MuhammedHasan commented 1 year ago

Hi @margaretc-ho,

It looks like there could be an issue in your bam file. If it is possible, can you share your bam file, or what is the output of this command samtools view -q 10 ${illumina_bam_dir}/${bamfile} | head?

Also, log files will help us to spot the issue. Can you share the content of this file ${outdir}/vb_annot/${samplename}_illumina/logs/progress.log?

margaretc-ho commented 1 year ago

Hi @MuhammedHasan thanks for taking a look! Here is one of the bam files throwing error

(lapa_env) [homc@ai-hpcn037 summary_bam]$ samtools view -q 10 NP_Prime.bam | head A00742:390:HK5VJDSX3:3:2150:23384:4163 163 AgamP4_2L 3042 60 101M = 3042 101 CTTATGTTGTTGACGTCCCCGGAACATAGCTACGGTGTGGAAACGTTGTGTAGGCAGGTTTTGCAGGTGCGATCGTCAAATCAACTCTAGCTTGTGCTTTT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:0 ZS:i:-8 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP NH:i:1 A00742:390:HK5VJDSX3:3:2150:23384:4163 83 AgamP4_2L 3042 60 101M = 3042 -101 CTTATGTTGTTGACGTCCCCGGAACATAGCTACGGTGTGGAAACGTTGTGTAGGCAGGTTTTGCAGGTGCGATCGTCAAATCAACTCTAGCTTGTGCTTTT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:0 ZS:i:-5 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:101 YS:i:0 YT:Z:CP NH:i:1 A00738:392:HK7JCDSX3:4:1601:30101:27164 163 AgamP4_2L 4144 60 8S142M = 4146 159 GTTATCTAATTTTTAGGTCATTTTTATGAACTTTGCGAGATTCTTCGGTTGTAATTGTGCTCCTATTAACATTTTTAACTTTAATCTTTTTGTATCTTGGTTTGTGTTTTAATCTAGCCCTAGTCTTTTCATAAATTGTTTTAGCAATTG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:-18 ZS:i:-14 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:31T96C13 YS:i:-10 YT:Z:CP NH:i:1 A00738:392:HK7JCDSX3:4:2567:23773:2268 99 AgamP4_2L 4144 60 8S142M = 4146 159 GTTATCTAATTTTTAGGTCATTTTTATGAACTTTGCGAGATTCTTCGGTTGTAATTGTGCTCCTATTAACATTTTTAACTTTAATCTTTTTGTATCTTGGTTTGTGTTTTAATCTAGCCCTAGTCTTTTCATAAATTGTTTTAGCAATTG FFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF:F:FFFFFFFF,FFFFF:FFF:FFFFFFFFFFFF:FFFFFF:FFFFFFFFFFFFFF,FFFFFF:FFFFF:FFFFF:FFFF:FFFFFFF,FFFFFFFFFFFFFFFFFF:FFFFFFFF:F AS:i:-18 ZS:i:-14 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:31T96C13 YS:i:-10 YT:Z:CP NH:i:1 A00738:392:HK7JCDSX3:4:1601:30101:27164 83 AgamP4_2L 4146 60 149M = 4144 -159 TTTTAGGTCATTTTTATGAACTTTGCGAGATTCTTCGGTTGTAATTGTGCTCCTATTAACATTTTTAACTTTAATCTTTTTGTATCTTGGTTTGTGTTTTAATCTAGCCCTAGTCTTTTCATAAATTGTTTTAGCAATTGGTATTGTTA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:-10 ZS:i:-15 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:29T96C22 YS:i:-18 YT:Z:CP NH:i:1 A00738:392:HK7JCDSX3:4:2567:23773:2268 147 AgamP4_2L 4146 60 149M = 4144 -159 TTTTAGGTCATTTTTATGAACTTTGCGAGATTCTTCGGTTGTAATTGTGCTCCTATTAACATTTTTAACTTTAATCTTTTTGTATCTTGGTTTGTGTTTTAATCTAGCCCTAGTCTTTTCATAAATTGTTTTAGCAATTGGTATTGTTA F,FFFFFFFFFFFFFFFFFF:FFFF:FFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:-10 ZS:i:-15 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:29T96C22 YS:i:-18 YT:Z:CP NH:i:1 A00738:392:HK7JCDSX3:4:1612:21160:17002 177 AgamP4_2L 4149 60 149M AgamP4_3L 1717891 0 TAGGTCATTTTTATGAACTTTGCGAGATTCTTCGGTTGTAATTGTGCTCCTATTAACATTTTTAACTTTAATCTTTTTGTATCTTGGTTTGTGTTTTAATCTAGCCCTAGTCTTTTCATAAATTGTTTTAGCAATTGGTATTGTTATCT FF:FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:-10 ZS:i:-15 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:26T96C25 YT:Z:UP NH:i:1 A00738:392:HK7JCDSX3:4:2351:6045:3474 145 AgamP4_2L 4149 60 149M AgamP4_X 20181859 0 TAGGTCATTTTTATGAACTTTGCGAGATTCTTCGGTTGTAATTGTGCTCCTATTAACATTTTTAACTTTAATCTTTTTGTATCTTGGTTTGTGTTTTAATCTAGCCCTAGTCTTTTCATAAATTGTTTTAGCAATTGGTATTGTTATCT FFF::FF:FFFFFFFFFFFFFFFFFFFFFF:FF:FFFF:FFFFFF:FFF:FFFFFFFFFFFF:FFFFFFFFFF:FFFFFFFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:-10 ZS:i:-15 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:26T96C25 YT:Z:UP NH:i:1 A00738:392:HK7JCDSX3:4:2439:25690:9972 177 AgamP4_2L 4149 60 149M AgamP4_3L 1717891 0 TAGGTCATTTTTATGAACTTTGCGAGATTCTTCGGTTGTAATTGTGCTCCTATTAACATTTTTAACTTTAATCTTTTTGTATCTTGGTTTGTGTTTTAATCTAGCCCTAGTCTTTTCATAAATTGTTTTAGCGATTGGTATTGTTATCT FFFFFFFFFFFFFFFFFFF,FFFFFF:FF,FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF AS:i:-14 ZS:i:-20 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:26T96C8A16 YT:Z:UP NH:i:1 A00738:392:HK7JCDSX3:4:1443:18765:9095 99 AgamP4_2L 4163 60 150M = 4251 238 GAACTTTGCGAGATTCTTCGGTTGTAATTGTGCTCCTATTAACATTTTTAACTTTAATCTTTTTGTATCTTGGTTTGTGTTTTAATCTAGCCCTAGTCTTTTCATAAATTGTTTTAGCAATTGGTATTGTTATCTCCCTTTTGTTTTTAT FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFFFF,FFFFFF:FF,FFFF,FFFFFFF:FF:FFFFFFFFFFFF:FF:FFFFFFFFFFFFFFFFFFFFF:FFFF,FFFFF:FFFFFFFFFFFFFFFFFFFFFFF AS:i:-10 ZS:i:-15 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:12T96C40 YS:i:-20 YT:Z:CP NH:i:1

This is the content of progress.log for the run

===== Counting reads (1 / 5) =====

Counting /hpcdata/bcbb/homc/Barillaslab/Agam_hemo_transcriptomes/alignmentfiles/illumina_hisat2_bam_files/summary_bam/NP_Prime.bam:

warnings.log and final_stats.log are empty

MuhammedHasan commented 1 year ago

@margaretc-ho, Your dataset is paired ended, and this is causing the issue. I haven't seen paired 3'-seq protocol before.

The directionality of reads is important to detect poly(A)-sites, and it could be tricky to get the correct direction from paired-ended protocol. Can you tell me the name of the protocol and provide some details? Is there a publicly available similar to your dataset?

I am happy to support paired-ended 3'-seq protocols in LAPA if you provide more details.

margaretc-ho commented 1 year ago

Hi @MuhammedHasan,

Some of these samples are unstranded, and other are directional second strand, could they still be used for polyA site detection? Most of the libraries are polyA selected using oligo-dT primers (a handful of libraries were prepared with ribominus). They are not specifically 3' seq. These are all Illumina sequenced (but we also have matching RNA-seq libraries for some samples with PacBio long read sequencing).

PolyA detection was not an explicit goal of the sequencing, but I am trying to use some information about polyA sites to determine transcript completeness. I hope that makes sense.

Thank you for trying to help me! It would be great to be able to do polyA detection on these kinds of paired end read samples.

MuhammedHasan commented 1 year ago

Hi @margaretc-ho,

Can you share IGV screenshots of the bam file for example gene one showing the entire gene body and another for just 3' UTR?