griffithlab / pmbio.org

Website for the precision medicine workshop
http://pmbio.org
MIT License
43 stars 21 forks source link

bed file #14

Open xiucz opened 6 years ago

xiucz commented 6 years ago

Hi, In this part, it writes

bedtools intersect -wa -wb -b /workspace/inputs/references/transcriptome/gene_annotation.bed -a WGS_Tumor_merged_sorted_mrkdup_bqsr.2.cns > WGS_Tumor_merged_sorted_mrkdup_bqsr.2.annotated.cns

I know that bed file is 0-based but cns file is also 0-based(mimused by 1). But it seems that we should plus 1 to the start of every recode in the result cns file? Because the CNS format is 1-based.

Thanks for your reply.

zlskidmore commented 6 years ago

hi @xiucz thanks for this report!

cnvkit outputs a 1-based copy number segment format from the documentation here: https://cnvkit.readthedocs.io/en/stable/fileformats.html

on the page you linked we run this to convert the 1-based coordinates from cnvkit to 0-based to match the bed file

tail -n +2 WGS_Tumor_merged_sorted_mrkdup_bqsr.cns | awk '{print $1"\t"$2-1"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8}' > WGS_Tumor_merged_sorted_mrkdup_bqsr.2.cns

So at this point WGS_Tumor_merged_sorted_mrkdup_bqsr.cns remains 1-based but WGS_Tumor_merged_sorted_mrkdup_bqsr.2.cns is now 0-based

I often refer to this biostarts post when doing these coordinate conversions https://www.biostars.org/p/84686/

we the run bedtools intersect on the 0-based bed file and the 0-based segment file. bedtools intersect -wa -wb -b /workspace/inputs/references/transcriptome/gene_annotation.bed -a WGS_Tumor_merged_sorted_mrkdup_bqsr.2.cns > WGS_Tumor_merged_sorted_mrkdup_bqsr.2.annotated.cns

so at this point bedtools intersect is working on two 0-based files so everything I think should be fine

Let me know if you disagree or if i've misunderstood the issue you've presented

xiucz commented 6 years ago

Hi,

we the run bedtools intersect on the 0-based bed file and the 0-based segment file.

This step, I agree with you, and the result file 2.annotated.cns is still 0-based. So if we want to use the result file to go on other analysis, is it better to convert it to 1-based?

And I have one more suggestion, rename ".2.annotated.cns" to ".annotated.bed", this will be more clearly to know the coordination system of the file for newers.

Thank you.