huslabgen commented 7 years ago

Hi, the below code fails in HG38 where some exons at alt contigs have start coord < 125.

Pad the targets file by 125 on each end

Merge the padded targets

Take off 25bp on each side to make a 100bp padded file

gawk -F'\t' '{ OFS = "\t" ; print $1,$2 - 125,$3 + 125 }' ${TARGETS} | \ sort -k1,1 -k2,2n | \ bedtools merge -i - | \ gawk -F'\t' '{ $2 = $2+25 ; $3 = $3-25 ; OFS = "\t" ; print $0 }' > ${OUT}/${OUT}_step1.bed

awchrist commented 7 years ago

Thanks @huslabgen I updated that step to set the start to 0 if the start of a padded target is equal to or less than 0.

I have not fully tested this with HG38 so there may be additional issues to work out. Please let me know if anything else comes up.

awchrist commented 7 years ago

@huslabgen I just noticed that the structure of the attribute column of the ensemble gtf has change in HG38. I will have to change how I parse the GTF before this will work.

huslabgen commented 7 years ago

Hi, Thanks for the fix. I believe that you may fix the GTF parsing issue by reading data from column 14 (i.e. transcript_id). However, applies only to CDS / transcript feature lines (while fails at e.g. gene feature lines lacking this information).

gawk -F'[\t"]' -v OUT=${OUT}/${OUT} 'FNR==NR && $3 == "transcript" { f[$14]=$10 } ; FNR==NR && $3 == "CDS" { t[$14]=$10 ; g[$10]=$10 ; next } ; { if($10 in a && $10 in g) {

From: Austin Christofferson notifications@github.com Reply-To: tgen/CovGen reply@reply.github.com Date: Friday, 15 September 2017 at 22.33 To: tgen/CovGen CovGen@noreply.github.com Cc: Kankainen Matti matti.kankainen@hus.fi, Mention mention@noreply.github.com Subject: Re: [tgen/CovGen] reading of BED fails at row 278 (#2)

@huslabgenhttps://github.com/huslabgen I just noticed that the structure of the attribute column of the ensemble gtf has change in HG38. I will have to change how I parse the GTF before this will work.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/tgen/CovGen/issues/2#issuecomment-329878905, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AeTF3I1iYiSxkJQ_LFv1aDXQOIAYql_Lks5sitETgaJpZM4PZSkq.

awchrist commented 7 years ago

Iv'e updated the gtf parsing of the attribute column. It can find "gene_id" and "transcript_id" no mater were they are in the column.

tgen / CovGen

reading of BED fails at row 278 #2

Pad the targets file by 125 on each end

Merge the padded targets

Take off 25bp on each side to make a 100bp padded file