Closed ahal closed 5 years ago
Hi @ahal I will do this. I opened https://hg.mozilla.org/
, then tried Mercurial > mozilla-central
but the list is different from the content of mc_root_dirs.txt. So I wonder how can I query hg.mozilla.org to get the list of all directories. Thank you.
Hi I tried again and it seems that Mercurial > mozilla-central > files
gives the list we need. Is it correct?
Yeah, I may have applied the 'help wanted' label a little too early here. I've been looking into ways of extracting directories out of mercurial but haven't been able to find anything. So maybe querying this information isn't possible.
I guess we could always scrape https://hg.mozilla.org/mozilla-central/file itself with beautifulsoup4 or something similar. If you like you can give scraping a shot, or else maybe we should pick a different issue.
Hi @ahal I used beautifulsoup4 as this code below (list_hidden means show hidden directory or not)
from bs4 import BeautifulSoup
import requests
def get_directory_list(url, list_hidden=False):
soup = BeautifulSoup(requests.get(url).text, "html.parser")
result = []
for link in soup.find_all('a', href=True, text="files"):
tmp = link.get("href").split("/")
dir_name = tmp[len(tmp)-1]
if list_hidden or (dir_name[0] != "."):
result.append("{}/".format(dir_name))
print(result)
return result
if __name__ == "__main__":
get_directory_list("https://hg.mozilla.org/mozilla-central/file")
The result is
['accessible/', 'browser/', 'build/', 'caps/', 'chrome/', 'config/', 'db/', 'devtools/', 'docshell/', 'dom/', 'editor/', 'embedding/', 'extensions/', 'gfx/', 'gradle/', 'hal/', 'image/', 'intl/', 'ipc/', 'js/', 'layout/', 'media/', 'memory/', 'mfbt/', 'mobile/', 'modules/', 'mozglue/', 'netwerk/', 'nsprpub/', 'other-licenses/', 'parser/', 'python/', 'security/', 'services/', 'servo/', 'startupcache/', 'storage/', 'taskcluster/', 'testing/', 'third_party/', 'toolkit/', 'tools/', 'uriloader/', 'view/', 'widget/', 'xpcom/', 'xpfe/']
What should I do in next step? Thank you.
Perfect!
Please put that logic into a utility file, maybe adr/util/hgmo.py
(which is short form for hg.mozilla.org). Then we can delete that mc_root_dirs.txt
and recipes that need it can call the util function instead.
Thank @ahal. I can't find the query or recipe which call mc_root_dirs.txt. How can I find it? Thank you.
Thanks for your contribution!
Thank @ahal. I can't find the query or recipe which call mc_root_dirs.txt. How can I find it? Thank you.
Sorry, I missed this comment. Yes, I went looking for it too after you submitted your PR and it looks like nothing was using it. I found the original commit that added that file, and it looks like it was always unused: https://github.com/mozilla/active-data-recipes/commit/15ddbb483883aa4f94404dcfaac1a300266c8f19
Though I talked to Joel about this and he said that he was planning to make use of it but then never got around to it. I think it'll be useful for certain recipes that we create in the future, so let's leave your scraper where it is for now.
Sorry about that :(
Hi @ahal . It is ok. I just asked this question because I intended to modified queries or recipes which call that file, so that the system will not have problem when we delete it. Solving this issue also helped me learn how to use beautifulsoup4. Thank you :)
A few recipes require a list of all directories that live in mozilla-central. As a very quick hack, we hardcoded this in the
mc_root_dirs.txt
file at the root of the repo. This is obviously bad because this list will become out of date.We should either write a new ActiveData query that provides this information, or else just query hg.mozilla.org for it (I think the latter might be simpler). Then we can be sure we have an up-to-date list of root mozilla-central dirs.
If we query hg.mozilla.org, we should put this function in a utility file. I propose
adr/util/hgmo.py
.