scverse / genomic-features

Genomic Features in Python from BioConductor's AnnotationHub
https://genomic-features.readthedocs.io
BSD 3-Clause "New" or "Revised" License
19 stars 5 forks source link

TSS/ promoter regions #4

Open ivirshup opened 1 year ago

ivirshup commented 1 year ago

Description of feature

Retrieve transcription start sites for genes for use with ATAC data.

This could be similar to the GenomicFeatures::promoters function (description available in this vignette) which:

The promoters function computes a GRanges object that spans the promoter region around the transcription start site for the transcripts in a TxDb object. The upstream and downstream arguments define the number of bases upstream and downstream from the transcription start site that make up the promoter region.

This could be done using bioframe.expand (though that is currently not strand aware: https://github.com/open2c/bioframe/issues/144). This could instead by done with ibis with: ifelse

emdann commented 1 year ago
emdann commented 1 year ago

Notes on filtering behaviour: in bioconductor the filtering is done at the level of transcripts, before the promoter sequences are defined.

Example: if promoter is within filtered range, but the transcript is not, then the promoter is not returned:

> transcripts(EnsDb.Hsapiens.v86, filter=GRangesFilter(GRanges('1:9000-12000'), type='any'))
GRanges object with 1 range and 6 metadata columns:
                  seqnames      ranges strand |           tx_id           tx_biotype tx_cds_seq_start tx_cds_seq_end         gene_id         tx_name
                     <Rle>   <IRanges>  <Rle> |     <character>          <character>        <integer>      <integer>     <character>     <character>
  ENST00000456328        1 11869-14409      + | ENST00000456328 processed_transcript             <NA>           <NA> ENSG00000223972 ENST00000456328
> promoters(EnsDb.Hsapiens.v86, filter=GRangesFilter(GRanges('1:9000-12000'), type='any'))
GRanges object with 1 range and 6 metadata columns:
                  seqnames     ranges strand |           tx_id           tx_biotype tx_cds_seq_start tx_cds_seq_end         gene_id         tx_name
                     <Rle>  <IRanges>  <Rle> |     <character>          <character>        <integer>      <integer>     <character>     <character>
  ENST00000456328        1 9869-12068      + | ENST00000456328 processed_transcript             <NA>           <NA> ENSG00000223972 ENST00000456328
  -------
  seqinfo: 1 sequence from GRCh38 genome

With within filtering:

> transcripts(EnsDb.Hsapiens.v86, filter=GRangesFilter(GRanges('1:9000-12000'), type='within'))
GRanges object with 0 ranges and 6 metadata columns:
   seqnames    ranges strand |       tx_id  tx_biotype tx_cds_seq_start tx_cds_seq_end     gene_id     tx_name
      <Rle> <IRanges>  <Rle> | <character> <character>        <integer>      <integer> <character> <character>
  -------
  seqinfo: no sequences
> promoters(EnsDb.Hsapiens.v86, filter=GRangesFilter(GRanges('1:9000-12000'), type='within'))
GRanges object with 0 ranges and 6 metadata columns:
   seqnames    ranges strand |       tx_id  tx_biotype tx_cds_seq_start tx_cds_seq_end     gene_id     tx_name
      <Rle> <IRanges>  <Rle> | <character> <character>        <integer>      <integer> <character> <character>
  -------
  seqinfo: no sequences