mauranolab / mapping

2 stars 1 forks source link

Color genotypes bb lines #250

Closed cadley-nyulangone closed 3 years ago

cadley-nyulangone commented 3 years ago

Modify callsnpsMerge.sh and MakeTrackhub.py to color the individual tracks of the genotypes.bb file.

fixes #215

cadley-nyulangone commented 3 years ago

I need to append the 4 new columns to the existing 5 columns in genotypes.ucsc.bed (which eventually becomes a genotypes.bb file).

The data for the 4 new columns is derived from the contents of the genotypes.starch file

So whatever builds the new 9 column genotypes.bb file needs to get data from both the initial 5 column genotypes.bb file, plus data from the genotypes.starch file.

===================================================================================================================== From an earlier email:

The "genotypes.bb" files are built in a for loop in callsnpsMerge.sh (lines 181 - 201).

To get them to print in color, we need to add four columns to the genotypes.ucsc.bed file, which is produced just prior to the call to bedToBigBed in the loop.

The existing genotypes.ucsc.bed files are bed5 files, and have this form: Sox2_Sox2_46kb_dDistalLCR_marker 1709 1710 Sox2_Sox2_46kb_dDistalLCR_marker:1710_A 127 Sox2_Sox2_46kb_dDistalLCR_marker 4535 4536 Sox2_Sox2_46kb_dDistalLCR_marker:4536_T 127 Sox2_Sox2_46kb_dDistalLCR_marker 21600 21601 Sox2_Sox2_46kb_dDistalLCR_marker:21601_CCG 127

We need to append these 4 columns to each row:

A problem is to figure out what color code to use for each line of the genotypes.ucsc.bed file.

I think that can be solved by looking inside the genotypes.starch file, which is produced prior to the genotypes.bb file within the loop.

The genotypes.starch file has this format: Sox2_Sox2_46kb_dDistalLCR_marker 1709 1710 Sox2_Sox2_46kb_dDistalLCR_marker:1710_A 127 G A A 255,0 19 0 19 Sox2_Sox2_46kb_dDistalLCR_marker 4535 4536 Sox2_Sox2_46kb_dDistalLCR_marker:4536_T 127 A T T 219,0 991 86 905 Sox2_Sox2_46kb_dDistalLCR_marker 21600 21601 Sox2_Sox2_46kb_dDistalLCR_marker:21601_CCG 127 C CCG CCG 237,0 20 2 18

===========================================================================================================================================

On Thu, Jan 28, 2021 at 4:45 PM mattmaurano notifications@github.com wrote:

@mattmaurano requested changes on this pull request.

In dnase/callsnpsMerge.sh https://github.com/mauranolab/mapping/pull/250#discussion_r566426141:

  • cat $TMPDIR/variants.${sampleid}.txt | awk -v maxlen=115 -F "\t" 'BEGIN {OFS="\t"} function truncgt(x) {if(length(x)>maxlen) {return substr(x, 1, maxlen) "..." } else {return x} } $4=="." {chromnum=$1; gsub(/^chr/, "", chromnum); $4=chromnum ":" $2+1 } {split($8, gt, "/"); gtout=truncgt(gt[1]); if(length(gt)>2) {gtout=gtout "/" truncgt(gt[2])}; $4=$4 "_" gtout } {print}' | sort-bed - | cut -f1-5 | \
  • paste - <(awk 'BEGIN {OFS="\t"; FS="\t"} {chromStart=$2; chromEnd=$3; ref=$6; betterGT=$8; numAlleles=split(betterGT,alleles,"/")}
  • Assign colors to the genotypes.bb lines.

  • {
  • if (numAlleles>1 && alleles[1]!=alleles[2])
  • {
  • colors="19,165,220"; # het - blue
  • } else
  • {
  • if (alleles[1] == ref)
  • colors="105,105,105"; # hz_ref - grey
  • else
  • colors="253,199,0"; # hz_nonref - yellow
  • }
  • }
  • {print "+" , chromStart , chromEnd , colors}' <(unstarch "${sampleOutdir}/${name}${vcfsamplename}.${mappedgenome}.genotypes.starch") ) > $TMPDIR/${name}${vcfsamplename}.genotypes.ucsc.bed

shouldn't itemRgb be in the 9th column? See https://genome.ucsc.edu/FAQ/FAQformat.html#format1

In dnase/callsnpsMerge.sh https://github.com/mauranolab/mapping/pull/250#discussion_r566427709:

 #NB UCSC link from analysis.sh will be wrong for multisample calling

Start from .txt file to simplify logic even though we have to sort again

If there is no database ID (e.g. rsid), set up bed ID as chrom : pos _ alt; for database IDs, append alt allele

truncgt() is a fancy wrapper around substr for syntactic sugar and to add a "..."

  • cat $TMPDIR/variants.${sampleid}.txt | awk -v maxlen=115 -F "\t" 'BEGIN {OFS="\t"} function truncgt(x) {if(length(x)>maxlen) {return substr(x, 1, maxlen) "..." } else {return x} } $4=="." {chromnum=$1; gsub(/^chr/, "", chromnum); $4=chromnum ":" $2+1 } {split($8, gt, "/"); gtout=truncgt(gt[1]); if(length(gt)>2) {gtout=gtout "/" truncgt(gt[2])}; $4=$4 "_" gtout } {print}' | sort-bed - | cut -f1-5 > $TMPDIR/${name}${vcfsamplename}.genotypes.ucsc.bed
  • bedToBigBed -type=bed5 $TMPDIR/${name}${vcfsamplename}.genotypes.ucsc.bed ${chromsizes} ${sampleOutdir}/${name}${vcfsamplename}.${mappedgenome}.genotypes.bb
  • cat $TMPDIR/variants.${sampleid}.txt | awk -v maxlen=115 -F "\t" 'BEGIN {OFS="\t"} function truncgt(x) {if(length(x)>maxlen) {return substr(x, 1, maxlen) "..." } else {return x} } $4=="." {chromnum=$1; gsub(/^chr/, "", chromnum); $4=chromnum ":" $2+1 } {split($8, gt, "/"); gtout=truncgt(gt[1]); if(length(gt)>2) {gtout=gtout "/" truncgt(gt[2])}; $4=$4 "_" gtout } {print}' | sort-bed - | cut -f1-5 | \
  • paste - <(awk 'BEGIN {OFS="\t"; FS="\t"} {chromStart=$2; chromEnd=$3; ref=$6; betterGT=$8; numAlleles=split(betterGT,alleles,"/")}

Why do you need the starch file as input? The awk statement in the prior line outputs the same data.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mauranolab/mapping/pull/250#pullrequestreview-578764621, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKY4NJV5IZQDGGRMR4QUH23S4HLIPANCNFSM4WT4M6SQ .

cadley-nyulangone commented 3 years ago

I guess I could just use the starch file to make the ucsc.bed file, but that could have been done originally too. I didn't want to change that.

On Thu, Jan 28, 2021 at 5:00 PM John Cadley cadley.nyulangone@gmail.com wrote:

I need to append the 4 new columns to the existing 5 columns in genotypes.ucsc.bed (which eventually becomes a genotypes.bb file).

The data for the 4 new columns is derived from the contents of the genotypes.starch file

So whatever builds the new 9 column genotypes.bb file needs to get data from both the initial 5 column genotypes.bb file, plus data from the genotypes.starch file.

===================================================================================================================== From an earlier email:

The "genotypes.bb" files are built in a for loop in callsnpsMerge.sh (lines 181 - 201).

To get them to print in color, we need to add four columns to the genotypes.ucsc.bed file, which is produced just prior to the call to bedToBigBed in the loop.

The existing genotypes.ucsc.bed files are bed5 files, and have this form: Sox2_Sox2_46kb_dDistalLCR_marker 1709 1710 Sox2_Sox2_46kb_dDistalLCR_marker:1710_A 127 Sox2_Sox2_46kb_dDistalLCR_marker 4535 4536 Sox2_Sox2_46kb_dDistalLCR_marker:4536_T 127 Sox2_Sox2_46kb_dDistalLCR_marker 21600 21601 Sox2_Sox2_46kb_dDistalLCR_marker:21601_CCG 127

We need to append these 4 columns to each row:

  • strand thickStart thickEnd colorCode*

    • strand is "+" or "-". For this application, I think we can just use "+" everywhere.
    • thickStart and thickEnd can be equal, and can also be the same as the start position value. It affects the thickness of the colored line.
    • colorCode is formatted like this: 255,0,0 or 0,255,0 etc.

A problem is to figure out what color code to use for each line of the genotypes.ucsc.bed file.

I think that can be solved by looking inside the genotypes.starch file, which is produced prior to the genotypes.bb file within the loop.

The genotypes.starch file has this format: Sox2_Sox2_46kb_dDistalLCR_marker 1709 1710 Sox2_Sox2_46kb_dDistalLCR_marker:1710_A 127 G A A 255,0 19 0 19 Sox2_Sox2_46kb_dDistalLCR_marker 4535 4536 Sox2_Sox2_46kb_dDistalLCR_marker:4536_T 127 A T T 219,0 991 86 905 Sox2_Sox2_46kb_dDistalLCR_marker 21600 21601 Sox2_Sox2_46kb_dDistalLCR_marker:21601_CCG 127 C CCG CCG 237,0 20 2 18

===========================================================================================================================================

On Thu, Jan 28, 2021 at 4:45 PM mattmaurano notifications@github.com wrote:

@mattmaurano requested changes on this pull request.

In dnase/callsnpsMerge.sh https://github.com/mauranolab/mapping/pull/250#discussion_r566426141:

  • cat $TMPDIR/variants.${sampleid}.txt | awk -v maxlen=115 -F "\t" 'BEGIN {OFS="\t"} function truncgt(x) {if(length(x)>maxlen) {return substr(x, 1, maxlen) "..." } else {return x} } $4=="." {chromnum=$1; gsub(/^chr/, "", chromnum); $4=chromnum ":" $2+1 } {split($8, gt, "/"); gtout=truncgt(gt[1]); if(length(gt)>2) {gtout=gtout "/" truncgt(gt[2])}; $4=$4 "_" gtout } {print}' | sort-bed - | cut -f1-5 | \
  • paste - <(awk 'BEGIN {OFS="\t"; FS="\t"} {chromStart=$2; chromEnd=$3; ref=$6; betterGT=$8; numAlleles=split(betterGT,alleles,"/")}
  • Assign colors to the genotypes.bb lines.

  • {
  • if (numAlleles>1 && alleles[1]!=alleles[2])
  • {
  • colors="19,165,220"; # het - blue
  • } else
  • {
  • if (alleles[1] == ref)
  • colors="105,105,105"; # hz_ref - grey
  • else
  • colors="253,199,0"; # hz_nonref - yellow
  • }
  • }
  • {print "+" , chromStart , chromEnd , colors}' <(unstarch "${sampleOutdir}/${name}${vcfsamplename}.${mappedgenome}.genotypes.starch") ) > $TMPDIR/${name}${vcfsamplename}.genotypes.ucsc.bed

shouldn't itemRgb be in the 9th column? See https://genome.ucsc.edu/FAQ/FAQformat.html#format1

In dnase/callsnpsMerge.sh https://github.com/mauranolab/mapping/pull/250#discussion_r566427709:

 #NB UCSC link from analysis.sh will be wrong for multisample calling

Start from .txt file to simplify logic even though we have to sort again

If there is no database ID (e.g. rsid), set up bed ID as chrom : pos _ alt; for database IDs, append alt allele

truncgt() is a fancy wrapper around substr for syntactic sugar and to add a "..."

  • cat $TMPDIR/variants.${sampleid}.txt | awk -v maxlen=115 -F "\t" 'BEGIN {OFS="\t"} function truncgt(x) {if(length(x)>maxlen) {return substr(x, 1, maxlen) "..." } else {return x} } $4=="." {chromnum=$1; gsub(/^chr/, "", chromnum); $4=chromnum ":" $2+1 } {split($8, gt, "/"); gtout=truncgt(gt[1]); if(length(gt)>2) {gtout=gtout "/" truncgt(gt[2])}; $4=$4 "_" gtout } {print}' | sort-bed - | cut -f1-5 > $TMPDIR/${name}${vcfsamplename}.genotypes.ucsc.bed
  • bedToBigBed -type=bed5 $TMPDIR/${name}${vcfsamplename}.genotypes.ucsc.bed ${chromsizes} ${sampleOutdir}/${name}${vcfsamplename}.${mappedgenome}.genotypes.bb
  • cat $TMPDIR/variants.${sampleid}.txt | awk -v maxlen=115 -F "\t" 'BEGIN {OFS="\t"} function truncgt(x) {if(length(x)>maxlen) {return substr(x, 1, maxlen) "..." } else {return x} } $4=="." {chromnum=$1; gsub(/^chr/, "", chromnum); $4=chromnum ":" $2+1 } {split($8, gt, "/"); gtout=truncgt(gt[1]); if(length(gt)>2) {gtout=gtout "/" truncgt(gt[2])}; $4=$4 "_" gtout } {print}' | sort-bed - | cut -f1-5 | \
  • paste - <(awk 'BEGIN {OFS="\t"; FS="\t"} {chromStart=$2; chromEnd=$3; ref=$6; betterGT=$8; numAlleles=split(betterGT,alleles,"/")}

Why do you need the starch file as input? The awk statement in the prior line outputs the same data.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mauranolab/mapping/pull/250#pullrequestreview-578764621, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKY4NJV5IZQDGGRMR4QUH23S4HLIPANCNFSM4WT4M6SQ .

mattmaurano commented 3 years ago

John, please look at the latest commit. Also take note of the style differences -- best to match the existing style. Can you please test this?

cadley-nyulangone commented 3 years ago

bedToBigBed wants +, -, or . in the strand field. It fails for a blank. I put in a . for the strand field.

bedToBigBed also doesn't like blank strand thickness fields. It generates an error message, although it also builds a bb file. However, the browser generates a "is not a big bed file" error for the bb file, and won't display a track. I filled in the strand thickness fields as they were before.

mattmaurano commented 3 years ago

Good catch!