twang15 / K562-Analysis

1 stars 1 forks source link

Prepare the motifs: automate the extraction of motif download #25

Open twang15 opened 2 years ago

twang15 commented 2 years ago

Problem statement:

We will have hundreds of models to explore, and their motif and PPM datasets on FactorBook have to be downloaded to SCG.

For example, this is the motif for ELF1. There are several files downloadable. https://www.factorbook.org/tf/human/ELF1/motif/ENCSR975SSR

But, we do not have the links to the motif files for batch downloading. Besides, we do not have the links to the motif PPM either.

twang15 commented 2 years ago

Solution (Part 1, meme download)

  1. For motif PPM: Shannon found a way to download all of them. Then, we can parse to extract each one. Hey Tao,

So this link has the file with all motif probability matrices: https://www.factorbook.org/motif/human/download . You can search for the experiment ID (which is the same ID as the model) to find the motifs that were enriched in that experiment.

With wget:

wget https://screen-beta-api.wenglab.org/factorbook_downloads/complete-factorbook-catalog.meme.gz
gzip -d complete-factorbook-catalog.meme.gz

Then, separate the records into files w/ AWK https://www.gnu.org/software/gawk/manual/html_node/awk-split-records.html

https://stackoverflow.com/questions/14634349/calling-an-executable-program-using-awk

Option 1. split the meme file into many small meme files all at once:

awk  '{print "touch "$2; print $0 > $2}' RS="" complete-factorbook-catalog.meme | bash

Option 2. (https://stackoverflow.com/questions/39384283/how-to-match-a-pattern-given-in-a-variable-in-awk) extract the target meme file on the fly

awk -v target="ENCSR437GBJ_TGGACTTTGRACYYW" '{if ($2 ~ target) {print "touch "$2; print $0 > $2} }' RS="" complete-factorbook-catalog.meme  | bash
twang15 commented 2 years ago

Solution (part 2, motif download)

use single-file (https://github.com/gildas-lormeau/SingleFile), You can save web pages to HTML from the command line interface. See here for more info: https://github.com/gildas-lormeau/SingleFile/blob/master/cli/README.MD.

On my mac:

# installation
npm install puppeteer@latest
sudo npm install -g "gildas-lormeau/SingleFile#master"

Trouble-shooting If the error message UnhandledPromiseRejectionWarning: Error: Browser is not downloaded. Run "npm install" or "yarn install" at ChromeLauncher.launch is displayed, it probably means that single-file was not able to find the executable of the browser. Using the option --browser-executable-path to pass to single-file the complete path of the executable fixes this issue.

Find chrome on my mac: (https://superuser.com/questions/772131/where-is-google-chrome-located-on-a-mac)

# download complete web page as html
single-file --browser-executable-path="/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" https://www.factorbook.org/tf/human/ELF1/motif/ENCSR975SSR ELF1.html    # this works

single-file --browser-executable-path="/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome" https://www.factorbook.org/tf/human/ELF1/motif/ENCSR975SSR ELF1.html  # this does not work

# extract the motif link
grep "hq-occurrences" ELF1.html | awk -F "=" '{for(i=1; i<=NF; i++) { if ($i ~ /hq-occurrences/) {split($i, a, "\""); print a[2]; } } }'

https://screen-beta-api.wenglab.org/factorbook_downloads/hq-occurrences/ENCFF133TSU_RCTTCCGG.gz https://screen-beta-api.wenglab.org/factorbook_downloads/hq-occurrences/ENCFF133TSU_GRASCCGGAAGTGG.gz https://screen-beta-api.wenglab.org/factorbook_downloads/hq-occurrences/ENCFF133TSU_TKRCGTCAYMRGNSSGCGCC.gz

twang15 commented 2 years ago

Use httrack to download a complete html

  1. httrack is an alternative for single-file: https://alternativeto.net/software/save-page-we/

    httrack --get https://www.encodeproject.org/experiments/ENCSR975SSR/ -O ELF1 -N ELF1.html