loosolab / UROPA

Universal RObust Peak Annotator
https://uropa-manual.readthedocs.io/
MIT License
15 stars 6 forks source link

Internals: center does not work #4

Closed mcsimenc closed 3 years ago

mcsimenc commented 3 years ago

I would like peaks located within genes to take priority over the distance setting, which I think is what internals:any and internals:center are for. For example, I set --distance to 2000,500 and internals to 1 (because command line arguments of "center" or "any" to --internals are not accepted nor is a value of center or any accepted for the internals key in my config), but I find that when the peak lies 1kb upstream of gene2 in an intron of gene1, the peak is still assigned to gene2.

The documentation here: https://uropa-manual.readthedocs.io/config.html explains the options of center and any for internals. It states that a value of True for internals is equivalent to internals:center. However in my uropa output (using --internals 1) the distance parameter still has precedence, which is clear when I look at my data in a genome browser. Am I missing something?

jenzopr commented 3 years ago

Hi Matt,

thanks for reaching out! Indeed, the documentation is not up-to-date regarding the internals settings, as you mentioned. I pushed a fix in a8f1b5b25c0d15af6195adb3be57a2bb9e6c527e. We will also update the relevant Figures.

Regardless of the documentation, we think UROPA's output is correct in your example. UROPA will always select the best hit according to distance. In your case, the peak lies well within 2kB upstream of the start of gene2 (1kB, as you mentioned). I therefore assume that the start of gene1 is farther away. If gene2 would not exists at all, gene1 would be reported instead as best hit, although it does not meet the distance thresholds.

Let me know if this helps. Best, Jens

mcsimenc commented 3 years ago

Hi Jens,

I've looked at several genes overlayed with my coverage tracks in a genome browser and there are definitely peaks within gene bodies where the peak is assigned to a different, downstream gene. I'll double check and get some screenshots/UROPA outputs to share when I get back home on Monday.

Thanks for the prompt response!

Matt

mcsimenc commented 3 years ago

Edit: I read your reply again and I see you said UROPA will always select the best hit according to distance. What I would like is to give priority to features containing peaks even if the start of that feature is further away from the peak than the start of a downstream feature. Is this possible?

Here are three peak in the UROPA *finalHits.bed output which are assigned to downstream genes despite being centered in upstream exons or introns.

5   27829463    27829765    5-28    74.000000   +   transcript  27828570    27829290    -   start   324 Downstream  0.0 0.0 Qs_050575-T1    query_1
12  18154030    18154332    12-36   65.000000   +   transcript  18155076    18156630    +   start   895 Upstream    0.0 0.0 Qs_027769-T1    query_1
12  26599128    26599430    12-52   59.000000   +   transcript  26593796    26598720    -   start   559 Downstream  0.0 0.0 Qs_028923-T1    query_1

Here is a genome browser screenshot showing the peak location relative to nearby genes:

misannotated

Here is the UROPA call I used (v3.1.0, I haven't been able to successfully install v3.5.3 via conda yet)

uropa -b 2020-10-08_Q.suber_DAPseq_AtMYB41_rep1rep3_noDuplicates.F4.5L4.5.bed \
      -g ../Quercus_suber_Qsuber_eudicot.gtf \
      --feature transcript \
      --feature_anchor start \
      --distance 2000 500 \
      --internals 1 \
      --show_attributes transcript_id \
      --strand ignore \
      -t 24

Thanks!

msbentsen commented 3 years ago

Hi, yes exactly - internals just allows for a peak to be annotated to a gene although the distance is further than given in --distance. However, a nearby gene with lower distance can take precedence over peaks internally in genes. There are different ways to give priority to peaks within transcripts:

Let us know if this works out for you!

mcsimenc commented 3 years ago

This is great. I tried to use this config json but got an error. Do you know what is wrong? Here is the config and output:

2020-10-23_uropa_peak_gene_assignment_config.json

{
"queries":[
    {"name": "peaks_in_transcripts", "feature": "transcript", "distance": [1,1], "internals": "True"},
    {"name": "peaks_distance", "feature": "transcript", "distance": [2000,500], "feature_anchor": "start", "internals": "True"}
          ],
"show_attributes": "transcript_id",
"priority": "True",
"gtf": "/home/msimenc/data/Athaliana/araport11/annotation/Athaliana_447_Araport11.gene_exons.gtf",
"bed": "A.thaliana_AtMYB41_noDuplicates.F2.5L2.5.rep2rep3_consensus_peaks.bed",
"outdir": "uropa"
}
$ uropa -i 2020-10-23_uropa_peak_gene_assignment_config.json 
2020-10-23 09:42:55 [INFO] - Started UROPA
2020-10-23 09:42:55 [INFO] - Working directory: /home/msimenc/analysis/arabidopsis/dapseq/myb41/replicates
2020-10-23 09:42:55 [INFO] - Command-line call: /home/msimenc/software/anaconda3/bin/uropa -i 2020-10-23_uropa_peak_gene_assignment_config.json
2020-10-23 09:42:55 [INFO] - Reading configuration from commandline/input config
2020-10-23 09:42:55 [INFO] - Reading .bed-file to annotate
2020-10-23 09:42:55 [INFO] - Preparing .gtf-file for fast access
2020-10-23 09:42:56 [INFO] - Started annotation
2020-10-23 09:42:56 [INFO] - Progress: 0%
2020-10-23 09:42:56 [INFO] - Progress: 1%
2020-10-23 09:42:56 [INFO] - Progress: 2%
2020-10-23 09:42:56 [INFO] - Progress: 3%
2020-10-23 09:42:56 [INFO] - Progress: 4%
2020-10-23 09:42:56 [INFO] - Progress: 5%
2020-10-23 09:42:56 [INFO] - Progress: 6%
2020-10-23 09:42:56 [INFO] - Progress: 7%
2020-10-23 09:42:56 [INFO] - Progress: 8%
2020-10-23 09:42:56 [INFO] - Progress: 9%
2020-10-23 09:42:56 [INFO] - Progress: 11%
2020-10-23 09:42:56 [INFO] - Progress: 12%
2020-10-23 09:42:56 [INFO] - Progress: 13%
2020-10-23 09:42:56 [INFO] - Progress: 14%
2020-10-23 09:42:56 [INFO] - Progress: 15%
2020-10-23 09:42:56 [INFO] - Progress: 16%
2020-10-23 09:42:56 [INFO] - Progress: 17%
2020-10-23 09:42:56 [INFO] - Progress: 18%
2020-10-23 09:42:56 [INFO] - Progress: 19%
2020-10-23 09:42:56 [INFO] - Progress: 20%
2020-10-23 09:42:56 [INFO] - Progress: 21%
2020-10-23 09:42:56 [INFO] - Progress: 22%
2020-10-23 09:42:56 [INFO] - Progress: 23%
2020-10-23 09:42:56 [INFO] - Progress: 24%
2020-10-23 09:42:56 [INFO] - Progress: 25%
2020-10-23 09:42:56 [INFO] - Progress: 26%
2020-10-23 09:42:56 [INFO] - Progress: 27%
2020-10-23 09:42:56 [INFO] - Progress: 28%
2020-10-23 09:42:56 [INFO] - Progress: 29%
2020-10-23 09:42:56 [INFO] - Progress: 31%
2020-10-23 09:42:56 [INFO] - Progress: 32%
2020-10-23 09:42:56 [INFO] - Progress: 33%
2020-10-23 09:42:56 [INFO] - Progress: 34%
2020-10-23 09:42:56 [INFO] - Progress: 35%
2020-10-23 09:42:56 [INFO] - Progress: 36%
2020-10-23 09:42:56 [INFO] - Progress: 37%
2020-10-23 09:42:56 [INFO] - Progress: 38%
2020-10-23 09:42:56 [INFO] - Progress: 39%
2020-10-23 09:42:56 [INFO] - Progress: 40%
2020-10-23 09:42:56 [INFO] - Progress: 41%
2020-10-23 09:42:56 [INFO] - Progress: 42%
2020-10-23 09:42:56 [INFO] - Progress: 43%
2020-10-23 09:42:56 [INFO] - Progress: 44%
2020-10-23 09:42:56 [INFO] - Progress: 45%
2020-10-23 09:42:56 [INFO] - Progress: 46%
2020-10-23 09:42:56 [INFO] - Progress: 47%
2020-10-23 09:42:56 [INFO] - Progress: 48%
2020-10-23 09:42:56 [INFO] - Progress: 49%
2020-10-23 09:42:56 [INFO] - Progress: 51%
2020-10-23 09:42:56 [INFO] - Progress: 52%
2020-10-23 09:42:56 [INFO] - Progress: 53%
2020-10-23 09:42:56 [INFO] - Progress: 54%
2020-10-23 09:42:56 [INFO] - Progress: 55%
2020-10-23 09:42:56 [INFO] - Progress: 56%
2020-10-23 09:42:56 [INFO] - Progress: 57%
2020-10-23 09:42:56 [INFO] - Progress: 58%
2020-10-23 09:42:56 [INFO] - Progress: 59%
2020-10-23 09:42:56 [INFO] - Progress: 60%
2020-10-23 09:42:56 [INFO] - Progress: 61%
2020-10-23 09:42:56 [INFO] - Progress: 62%
2020-10-23 09:42:56 [INFO] - Progress: 63%
2020-10-23 09:42:56 [INFO] - Progress: 64%
2020-10-23 09:42:56 [INFO] - Progress: 65%
2020-10-23 09:42:56 [INFO] - Progress: 66%
2020-10-23 09:42:56 [INFO] - Progress: 67%
2020-10-23 09:42:56 [INFO] - Progress: 68%
2020-10-23 09:42:56 [INFO] - Progress: 69%
2020-10-23 09:42:56 [INFO] - Progress: 71%
2020-10-23 09:42:56 [INFO] - Progress: 72%
2020-10-23 09:42:56 [INFO] - Progress: 73%
2020-10-23 09:42:56 [INFO] - Progress: 74%
2020-10-23 09:42:56 [INFO] - Progress: 75%
2020-10-23 09:42:56 [INFO] - Progress: 76%
2020-10-23 09:42:56 [INFO] - Progress: 77%
2020-10-23 09:42:56 [INFO] - Progress: 78%
2020-10-23 09:42:56 [INFO] - Progress: 79%
2020-10-23 09:42:56 [INFO] - Progress: 80%
2020-10-23 09:42:56 [INFO] - Progress: 81%
2020-10-23 09:42:56 [INFO] - Progress: 82%
2020-10-23 09:42:56 [INFO] - Progress: 83%
2020-10-23 09:42:56 [INFO] - Progress: 84%
2020-10-23 09:42:56 [INFO] - Progress: 85%
2020-10-23 09:42:56 [INFO] - Progress: 86%
2020-10-23 09:42:56 [INFO] - Progress: 87%
2020-10-23 09:42:56 [INFO] - Progress: 88%
2020-10-23 09:42:56 [INFO] - Progress: 89%
2020-10-23 09:42:56 [INFO] - Progress: 91%
2020-10-23 09:42:56 [INFO] - Progress: 92%
2020-10-23 09:42:56 [INFO] - Progress: 93%
2020-10-23 09:42:56 [INFO] - Progress: 94%
2020-10-23 09:42:56 [INFO] - Progress: 95%
2020-10-23 09:42:56 [INFO] - Progress: 96%
2020-10-23 09:42:56 [INFO] - Progress: 97%
2020-10-23 09:42:56 [INFO] - Progress: 98%
2020-10-23 09:42:56 [INFO] - Progress: 99%
2020-10-23 09:42:56 [INFO] - Progress: 100%
2020-10-23 09:42:56 [INFO] - Processing annotated peaks
Traceback (most recent call last):
  File "/home/msimenc/software/anaconda3/bin/uropa", line 11, in <module>
    load_entry_point('uropa==3.1.0', 'console_scripts', 'uropa')()
  File "/home/msimenc/software/anaconda3/lib/python3.7/site-packages/uropa/uropa.py", line 413, in main
    header_output = main + attribute_columns + ["name"]
TypeError: can only concatenate list (not "str") to list
jenzopr commented 3 years ago

I think it should be

{
...
"show_attributes": ["transcript_id"],
...
}

instead. We should write a check for this @msbentsen .

mcsimenc commented 3 years ago

I deleted my previous post saying the JSON input didn't work. It was caused by having different chromosome names in my GTF and BED input.

It seems to work fine. Thanks very much for your help!

Best,

Matt

jenzopr commented 3 years ago

Great you figured it out and thanks for using UROPA :-)

msbentsen commented 3 years ago

I think it should be

{
...
"show_attributes": ["transcript_id"],
...
}

instead. We should write a check for this @msbentsen .

@jenzopr There is already a check for this included since uropa 3.4.0 (ref here). @mcsimenc is running 3.1.0, so that is the reason for the error. There were a lot of bugfixes since then, so I recommend to update to current version 3.5.3 :-)