t-neumann / slamdunk

Streamlining SLAM-seq analysis with ultra-high sensitivity
GNU Affero General Public License v3.0
37 stars 22 forks source link

about 3utr numbers #152

Closed bioinformatica closed 5 months ago

bioinformatica commented 5 months ago

The ucsc-v45 3utr I downloaded from the ucsc table browser has 140,000 lines. I saw that your final file has about 30,000 lines. Please tell me how to convert it into 30,000 genes. Below is my 3utr file:

chr1 67092164 67093004 ENST00000684719.1_utr3_7_0_chr1_67092165_r 0 - chr1 67092164 67093004 ENST00000371007.6_utr3_7_0_chr1_67092165_r 0 - chr1 67092175 67093004 ENST00000371006.5_utr3_5_0_chr1_67092176_r 0 - chr1 67092175 67093579 ENST00000475209.6_utr3_6_0_chr1_67092176_r 0 - chr1 67092396 67096311 ENST00000621590.4_utr3_2_0_chr1_67092397_r 0 - chr1 201328836 201328868 ENST00000263946.7_utr3_13_0_chr1_201328837_f 0 + chr1 201330073 201332993 ENST00000263946.7_utr3_14_0_chr1_201330074_f 0 + chr1 201328836 201328868 ENST00000367324.8_utr3_12_0_chr1_201328837_f 0 + chr1 201330073 201332989 ENST00000367324.8_utr3_13_0_chr1_201330074_f 0 + chr1 8352396 8355086 ENST00000337907.7_utr3_23_0_chr1_8352397_r 0 - chr1 8352403 8355086 ENST00000400908.7_utr3_22_0_chr1_8352404_r 0 - chr1 8352405 8355086 ENST00000377464.5_utr3_16_0_chr1_8352406_r 0 - chr1 8352409 8355086 ENST00000465125.2_utr3_12_0_chr1_8352410_r 0 - chr1 8353942 8355086 ENST00000400907.6_utr3_14_0_chr1_8353943_r 0 - chr1 8354092 8355086 ENST00000476556.5_utr3_12_0_chr1_8354093_r 0 - chr1 33513997 33516570 ENST00000373381.9_utr3_70_0_chr1_33513998_r 0 - chr1 33519464 33519517 ENST00000373381.9_utr3_69_0_chr1_33519465_r 0 - chr1 33513998 33516570 ENST00000619121.4_utr3_70_0_chr1_33513999_r 0 - chr1 33519464 33519517 ENST00000619121.4_utr3_69_0_chr1_33519465_r 0 - chr1 33514008 33516570 ENST00000373388.7_utr3_69_0_chr1_33514009_r 0 -

best

t-neumann commented 5 months ago

What I did was to merge all UTRs for a given gene with bedtools merge. That would still result in more counting windows per gene in many cases which is ok

bioinformatica commented 5 months ago

Dear Tobias:

If the three coordinates of the transcript do not overlap, will the merged gene also have multiple lines? For example: Before merging: chr1 1 10 gene1 transcript 3utr chr1 5 15 gene1 transcript 3utr chr1 18-20 gene1 transcript 3utr

After merging: chr1 1 15 gene1 chr1 18-20 gene1

best

t-neumann commented 5 months ago

Yes ten you will have multiple entries per gene, correct

bioinformatica commented 5 months ago

Thank you~