mozack / ubu

UNC-Chapel Hill Bioinformatics Utilities
23 stars 13 forks source link

sam-xlate bed format /hg38 / gencode #4

Open darked89 opened 5 years ago

darked89 commented 5 years ago

Hello,

i have tried:

java -jar ./ubu-1.2b-SNAPSHOT-jar-with-dependencies.jar sam-xlate \
--bed gencode.v31.annotation.no_head.bed \
--in test.hg38genome.bam \
--out test.hg38genome.bam

I am getting following error:

<snip>
Skipping isoform header order determination
Building read index
Exception in thread "main" java.lang.NumberFormatException: For input string: "transcript_id"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Integer.parseInt(Integer.java:580)
    at java.lang.Integer.parseInt(Integer.java:615)
    at edu.unc.bioinf.ubu.sam.IsoformIndex.getExonCoordinates(IsoformIndex.java:49)
    at edu.unc.bioinf.ubu.sam.IsoformIndex.buildReadToIsoformIndex(IsoformIndex.java:93)
    at edu.unc.bioinf.ubu.sam.GenomeToTranscriptome.run(GenomeToTranscriptome.java:70)
    at edu.unc.bioinf.ubu.Ubu.run(Ubu.java:42)
    at edu.unc.bioinf.ubu.Ubu.main(Ubu.java:91)

My java version:

java version "1.8.0_212"

It seems that converting the gencode GTF with awk/bedops:

awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' gencode.v31.annotation.no_head.gtf  | gtf2bed - > \
gencode.v31.annotation.no_head.bed

does not produce BED format resembling one provided with ubu:

ubu/src/test/java/edu/unc/bioinf/ubu/sam/testdata/unc_hg19.bed

Any ideas how to convert gencode.v31 for hg38 into ubu-compatible BED?

Many thanks for your help in advance.

Darek Kedra

VitorAguiar commented 3 years ago

Hi Darek, I also would like to know how to generate a sam-xlate compatible BED file from a GTF. Please, did you figure it out?

Vitor

darked89 commented 3 years ago

Hi Vitor,

I need to go through the old notes/emails to check the steps. I will keep you posted.

Darek

darked89 commented 3 years ago

Hello Vitor,

I have not found the working solution. But maybe we can figure it out.

The ubu bed format is BED12 one line per transcript:

chr1    11873   14409   uc001aaa.3  0   +   0   0   0   3   354,109,1189,   0,739,1347,
chr1    11873   14409   uc010nxq.1  0   +   0   0   0   3   354,127,1007,   0,721,1529,

gtf2bed from bedops outputs one feature per line. And includes gene, transcript and exon. So the output must have transcript_id in the 4th column and the transcript exons data in:

10. blockCount - the number of sub-elements (e.g. exons) within the feature
11. blockSizes - the size of these sub-elements
12 blockStarts - the start coordinate of each sub-element

Above is from the https://m.ensembl.org/info/website/upload/bed.html

You may check this gist: https://gist.github.com/gireeshkbogu/f478ad8495dca56545746cd391615b93

but I have not tested it.

DK

VitorAguiar commented 3 years ago

Hi Darek,

thank you for your insight.

I have had no luck with the GTF -> BED12 conversion tools. Instead, I've created my own script, since the specifications that you described don't look very challenging. As far as I understood, I just need to take the GTF, (1) select the exon entries, (2) convert from 1-based to 0-based coordinates, (3) compute block sizes (exon lengths), and (4) compute block starts (start positions of each exon - start position of first exon), and append commas in the end for columns 11 and 12. Other columns (such as score, thickStart, thickEnd, itemRgb) seem to be all set to zero in the example BED, and are apparently not relevant for sam-xlate.

Then, I tried to run a test with a single human sample, using Gencode v36.

From the example in the wiki, it seems like we need a BAM file sorted by reference (chromosomes) and then by read names. This is difficult, since standard tools usually sort either by reference and position, or by read name. I asked this question on Biostars, and a tool was recommended. Another option would be to split the BAM by chromosome with bamtools, sort each resulting BAM by read name, and finally merge all BAMs. I tried using sam-xlate with a BAM only sorted by read names, and the job could not finish after 24 hours. When I sort by reference, and then by read names, it goes much faster.

I wasn't able to provide a transcripts fasta for the --order option. The program complains about an invalid entry right at the first line. According to the source code, this is an error thrown in case the ID line in the fasta file is shorter than 2 chars, so it doesn't make sense to me. However, since this is optional, I proceeded.

I was able to run sam-xlate to convert a BAM generated with STAR. Then I estimated expression levels with Salmon for the sam-xlate BAM, and also for a BAM generated directly by STAR in transcriptome coordinates. When I compare the transcript-level estimates from the different methods, I see substantial differences.

To begin with, the sam-xlate BAM is only 40% the size of the BAM generated by STAR in transcriptome coordinates.

For transcripts with TPM = 0 in sam-xlate, 33k had TPM > 0 with STAR direct conversion (and 12k TPM > 1). On the other hand, for transcripts with TPM = 0 with STAR direct conversion, 6k had TPM > 0 with sam-xlate (and 2k TPM > 1). See figure below (similar picture for gene-level estimates).

I will revise my conversion script, and further investigate the reasons for such differences.

estimates