mozilla / active-data-recipes

A repository of various activedata queries and recipes
Mozilla Public License 2.0
8 stars 24 forks source link

Stop hardcoding mc_root_dirs.txt #33

Closed ahal closed 5 years ago

ahal commented 6 years ago

A few recipes require a list of all directories that live in mozilla-central. As a very quick hack, we hardcoded this in the mc_root_dirs.txt file at the root of the repo. This is obviously bad because this list will become out of date.

We should either write a new ActiveData query that provides this information, or else just query hg.mozilla.org for it (I think the latter might be simpler). Then we can be sure we have an up-to-date list of root mozilla-central dirs.

If we query hg.mozilla.org, we should put this function in a utility file. I propose adr/util/hgmo.py.

TrangNguyenBC commented 5 years ago

Hi @ahal I will do this. I opened https://hg.mozilla.org/, then tried Mercurial > mozilla-central but the list is different from the content of mc_root_dirs.txt. So I wonder how can I query hg.mozilla.org to get the list of all directories. Thank you.

TrangNguyenBC commented 5 years ago

Hi I tried again and it seems that Mercurial > mozilla-central > files gives the list we need. Is it correct?

ahal commented 5 years ago

Yeah, I may have applied the 'help wanted' label a little too early here. I've been looking into ways of extracting directories out of mercurial but haven't been able to find anything. So maybe querying this information isn't possible.

I guess we could always scrape https://hg.mozilla.org/mozilla-central/file itself with beautifulsoup4 or something similar. If you like you can give scraping a shot, or else maybe we should pick a different issue.

TrangNguyenBC commented 5 years ago

Hi @ahal I used beautifulsoup4 as this code below (list_hidden means show hidden directory or not)

from bs4 import BeautifulSoup

import requests

def get_directory_list(url, list_hidden=False):

    soup = BeautifulSoup(requests.get(url).text, "html.parser")
    result = []

    for link in soup.find_all('a', href=True, text="files"):
        tmp = link.get("href").split("/")
        dir_name = tmp[len(tmp)-1]
        if list_hidden or (dir_name[0] != "."):
            result.append("{}/".format(dir_name))

    print(result)
    return result

if __name__ == "__main__":
    get_directory_list("https://hg.mozilla.org/mozilla-central/file")

The result is ['accessible/', 'browser/', 'build/', 'caps/', 'chrome/', 'config/', 'db/', 'devtools/', 'docshell/', 'dom/', 'editor/', 'embedding/', 'extensions/', 'gfx/', 'gradle/', 'hal/', 'image/', 'intl/', 'ipc/', 'js/', 'layout/', 'media/', 'memory/', 'mfbt/', 'mobile/', 'modules/', 'mozglue/', 'netwerk/', 'nsprpub/', 'other-licenses/', 'parser/', 'python/', 'security/', 'services/', 'servo/', 'startupcache/', 'storage/', 'taskcluster/', 'testing/', 'third_party/', 'toolkit/', 'tools/', 'uriloader/', 'view/', 'widget/', 'xpcom/', 'xpfe/']

What should I do in next step? Thank you.

ahal commented 5 years ago

Perfect!

Please put that logic into a utility file, maybe adr/util/hgmo.py (which is short form for hg.mozilla.org). Then we can delete that mc_root_dirs.txt and recipes that need it can call the util function instead.

TrangNguyenBC commented 5 years ago

Thank @ahal. I can't find the query or recipe which call mc_root_dirs.txt. How can I find it? Thank you.

ahal commented 5 years ago

Thanks for your contribution!

ahal commented 5 years ago

Thank @ahal. I can't find the query or recipe which call mc_root_dirs.txt. How can I find it? Thank you.

Sorry, I missed this comment. Yes, I went looking for it too after you submitted your PR and it looks like nothing was using it. I found the original commit that added that file, and it looks like it was always unused: https://github.com/mozilla/active-data-recipes/commit/15ddbb483883aa4f94404dcfaac1a300266c8f19

Though I talked to Joel about this and he said that he was planning to make use of it but then never got around to it. I think it'll be useful for certain recipes that we create in the future, so let's leave your scraper where it is for now.

Sorry about that :(

TrangNguyenBC commented 5 years ago

Hi @ahal . It is ok. I just asked this question because I intended to modified queries or recipes which call that file, so that the system will not have problem when we delete it. Solving this issue also helped me learn how to use beautifulsoup4. Thank you :)