ncbi / sra-tools

SRA Tools
Other
1.07k stars 243 forks source link

How to change a Single-molecule long-read sequencing sra file into PacBio bam file? #770

Open wu116 opened 1 year ago

wu116 commented 1 year ago

Dear Developers, hello! There are some projects using PacBio single-molecule long-read sequencing to analyze full-length transcriptome, but the raw data is in a bam type file and has to be changed into sra type file for uploading to SRA database. I want to analyze using the official software isoseq3 which need a special PacBio bam file, but the sam-dump cannot change the sra file into the special PacBio bam file correctly. It seens that some information lose when uploader change the special PacBio bam file into sra file. Could you please give some adivce?

durbrow commented 1 year ago

I'm not familiar with PacBIO's toolset, so I may be wrong...

From looking at their documentation for their SAM/BAM files, their tools expect CIGAR to use = and X instead of the usual M and will quit with an error when the M is encountered. If this is the cause, there should be a simple solution.

sam-dump has an option to use = and X, -c | --cigar-long.

wu116 commented 1 year ago

Thank for relying!

But It may not be the cause in my case. Here are the first line in the sam file that the sam-dump generate whether with -c or not.

1       4       *       0       0       *       *       0       0       AGTTGTGGGAAGGAAGTTTTGATTGGTGAGGATGTGTTTGGTTTTGATTTTAATGATGTTATTAATTGATTTGTGAGTGTTTGATTAAGTAAGTTAAGTATAGTTGGTTGATGGAGTTGTTTGGGTTGAGATTTATAAAGAGTGAGTGGTGTAGCGATTGGGTAAAGAGGAGAAGATTTCGATTGTGTGGTTTTACAAGAGAACAATAACATGGAGTAGGATGTGCATATTAGTGCGAGTGGTAG   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!   RG:Z:1c1c6cfd

There maybe some improtant information in the header of PacBio bam file, but the header cannot be retained though I add the -r|--header.

wu116 commented 1 year ago

I check the sra file and find there is the PacBio bam header there.

$BAM_HEADER@HD      VN:1.5  SO:unknown      pb:3.0.7
@RG     ID:1c1c6cfd     PL:PACBIO       DS:READTYPE=SUBREAD;Ipd:CodecV1=ip;PulseWidth:CodecV1=pw;BINDINGKIT=101-789-500;SEQUENCINGKIT=101-826-100
;BASECALLERVERSION=5.0.0;FRAMERATEHZ=100.000000;BarcodeFile=/share02/bioCloud/compute/cloudpub/u4359N4/dataProcess_RUN293_D07_20210111180029/spli
t_xml_1610359240111/outputs/barcode.fa;BarcodeHash=2ad43f747b13dbca24d0688e8dff8ab2;BarcodeCount=11;BarcodeMode=Symmetric;BarcodeQuality=Score
        LB:AF031-AD159-AC883 -ISO       PU:m64087_210109_061940 SM:ISO  PM:SEQUELII
@PG     ID:baz2bam      PN:baz2bam      VN:9.0.0.92233  CL:/opt/pacbio/ppa-9.0.0/bin/baz2bam /data/pa/m64087_210109_061940.baz -o /data/pa/m64087
_210109_061940 --metadata /data/pa/.m64087_210109_061940.metadata.xml -j 32 -b 8 --inlinePbi --progress --silent --maxInputQueueMB 70000 --zmwBat
chMB 50000 --zmwHeaderBatchMB 30000 
@PG     ID:bazFormat    PN:bazformat    VN:1.6.0
@PG     ID:bazwriter    PN:bazwriter    VN:9.0.0
@PG     ID:lima VN:1.9.0 (commit 7727b1f) 

Are there any method for sam-dump to keep this header?

durbrow commented 1 year ago

What is the accession you are working with?

wu116 commented 1 year ago

SRR16979014 in the project PRJNA774118.

wu116 commented 1 year ago

I find the point that the PacBio bam file has some additional columns after the original columns in the BODY, the header I showed may be irrelevance. Losing those columns may be the actual cause why the isoseq3 give error.

The best way to solve this problem may be attaching the original bam files submitted to SRA by uploaders through AWS of GCP.

But I still wonder if there will be possible to change the sra file into the PacBio bam file in the future after some updating of sra-tools or not?

Thanks for your kindly help again. : )

NatJWalker-Hale commented 1 year ago

Experiencing a similar problem - would love to work with some official pb tools which require pb bam headers, but these are not preserved when using sam-dump to write SAM (and converting to BAM with samtools view) from .sra files. Is the original header information kept in the .sra?

wu116 commented 1 year ago

I think the header I gave above might just be the header for the entire bam file, the header for each read have not been kept in the sra file. So I gave up and tried to access the raw data in bam file.

MenglinC commented 2 months ago

Thank for relying!

But It may not be the cause in my case. Here are the first line in the sam file that the sam-dump generate whether with -c or not.

1       4       *       0       0       *       *       0       0       AGTTGTGGGAAGGAAGTTTTGATTGGTGAGGATGTGTTTGGTTTTGATTTTAATGATGTTATTAATTGATTTGTGAGTGTTTGATTAAGTAAGTTAAGTATAGTTGGTTGATGGAGTTGTTTGGGTTGAGATTTATAAAGAGTGAGTGGTGTAGCGATTGGGTAAAGAGGAGAAGATTTCGATTGTGTGGTTTTACAAGAGAACAATAACATGGAGTAGGATGTGCATATTAGTGCGAGTGGTAG   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!   RG:Z:1c1c6cfd

There maybe some improtant information in the header of PacBio bam file, but the header cannot be retained though I add the -r|--header.

I meet the same question. I downloaded the sra files from the database and convert it to sam or bam files using the sam-dump,but the result files can not be used for the downstream analysis.