Builtwith API Project - Githubissues

ryanmswan commented 2 years ago

Overview

Project: Open Community Survey

Volunteer Opportunity: Create scraper to get information from builtwith.com on technologies used by neighborhood council websites. Organize the data (create categories for the tech), Automate scrape job to run periodically. Additionally, we want to display this information with a dashboard (see Google Data Studio Dashboard linked below under "Project Output" for an example).

Contact: ~Ryan Swan (data science), Kaylani (open community survey)~ Bonnie

Action Items

[x] Create a wiki page
[x] Build a scraper that can we reuse to get the data on the NC site technologies
[x] Add the scripts and other code to the data science repo or if another repo is required, let the leads know.
[x] Create a Spreadsheet from the results of initial scrape
[x] ~Create a set of categories in the spreadsheet~
[x] Rework the script to grab category as well as technology (its available in the API)
[x] Add category to each technology, so that the data can be grouped and analyzed will happen automatically prior item is done
[ ] Assess code for current scraper to determine if it still functions properly
[ ] Perform additional analysis on Widget technology category: Which sites are using calendars? What are the calendars used for (events of the NC or local events)? Which sites use chatbots? Which sites have search functionality? How many sites use translation widgets?
[ ] Finish analysis of following technology categories: Content Management system (cms), Mobile, SSL, Payment, Framework, and Copyright
[ ] Fix directory issues with code. Currently, it's in the 311 directory but needs to be moved to the open community survey directory.
[ ] ~Create a reusable matching table of technology to category~
[ ] ~Create a script to be able to create a new spreadsheet with the matching table so that the technologies are already categorized (except of course the ones that are new).~
[ ] ~create instructions for updating matching table and running scripts.~
[ ] make sure wiki is updated.
[ ] release dependency on - https://github.com/hackforla/open-community-survey/issues/28

Resources/Instructions

External Tools

Builtwith
- https://builtwith.com
- builtwith API
- API limitations: Some sites, are resistant to being crawled (WordPress, for instance https://atwatervillage.org/calendar/). So what we need is a list of all the sites that can't be put through the sitemap maker. See notes about WordPress site crawling: https://community.funnelback.com/knowledge-base/implementation/Gather-And-Index/integration/crawl-wordpress-sites
Selenium
- Selenium Scraping Tools - see branch
Docker

Tutorial

Selenium/Docker, made by Sophia

Project input (data)

Target Website List Here - this is one tab on a larger analysis workbook.

Past Collaborators: @akibrhast, @ava li, @Sarah Williams, @wendywilhelm10 @rajindermavi @ShikaZzz @JessicaFB @Poorvi Rao

ryanmswan commented 2 years ago

Target Website List Here

ExperimentsInHonesty commented 2 years ago

@ryanswan @salice @JessicaFB @poorvi4 @akibrhast, @ava li, @sarah Williams, @wendywilhelm10 I forgot to mention that @mattyweb has a report tool that he has setup on our AWS and it might be a good place to dump all this data (from the Comparative analysis of features and then the technologies from builtwith api), then we can define the types of reports we want to display and it would allow people to look at a single NC, as well as aggregate stats, etc. Think of it as a real time data visualizers for end users once we have figured out what is worth looking at. At least that's my understanding of how it works. Would be good to ask him to come to the Data Science Community of Practice at some point to discuss it.

ExperimentsInHonesty commented 2 years ago

@salice Please provide update

Progress
Blockers
Availability
ETA

ebele-oputa commented 2 years ago

@ryanmswan Please provide update

Progress Blockers Availability ETA

chelseybeck commented 2 years ago

@rajindermavi here is the drive with the tutorials. They're large files so they might be still uploading for a few minutes.

rajindermavi commented 2 years ago

@rajindermavi here is the drive with the tutorials. They're large files so they might be still uploading for a few minutes.

@chelseybeck The folder is empty.

ebele-oputa commented 2 years ago

@rajindermavi Please provide update:

Progress
Blockers
Availability
ETA

rajindermavi commented 2 years ago

@ebele-oputa

I got the docker / selenium version working. I can scrape builtwith and output data to a json file. I am now going to try to get all websites from online source using selenium. I'll be at the data science meeting tonight to discuss.

rajindermavi commented 2 years ago

@ebele-oputa

I made a pull request with the webscraping for all websites. It includes a dockerfile and script that produces a json file, which I included.

ebele-oputa commented 2 years ago

Thanks @rajindermavi for the updates and the work done! Looking forward to receiving a readable file containing the list in a Google sheet.

rajindermavi commented 2 years ago

@ebele-oputa

Thanks Ebele! The google sheet output is here let me know if there is anything else.

ShikaZzz commented 2 years ago

@rajindermavi If my understanding is correct, the current issue is that we need to extract the link of each webpage in an NC website and run them through builtwith because builtwith can only analyze the technology used for a specific link instead of the entire website.
In order to extract the link of each page, we could use online tool but the running time is super long and it also extracts the pdf file links. Instead, we could use Python package or /sitemap.xml but neither of these 2 methods can return results if a website does not have a sitemap

What would be the ETA for the next step?

ExperimentsInHonesty commented 2 years ago

The issue is not that the site does not have a sitemap, its that some sites, are resistant to being crawled (wordpress for instance https://atwatervillage.org/calendar/). So what we need is a list of all the sites that can't be put through the sitemap maker.

See notes about wordpress site crawling: https://community.funnelback.com/knowledge-base/implementation/Gather-And-Index/integration/crawl-wordpress-sites

@rajindermavi can you provide a progress report on this issue?

ShikaZzz commented 2 years ago

Rajinder's personal webscraping repo - seems to be ahead of the repo https://github.com/hackforla/data-science/tree/main/311-data/webscraping

ShikaZzz commented 2 years ago

finished scrapping and is working on extracting data from json file and data analysis. plans: extracting data, possibly arrange data into unified dataframe, data analysis ETA for data analysis is about a week.

ExperimentsInHonesty commented 2 years ago

Abe and Bonnie will clean this issue up. Objective is that this spreadsheet OCS: Builtwith data on 99 NCs technologies will have the data on the NCs that is needed for understanding what they use tech for.

ExperimentsInHonesty commented 2 years ago

At the top of this issue there was a link for [OCS - NC: Competitive or Comparative Analysis Template] but it went to https://www.sciencedirect.com/science/article/abs/pii/S0161642016307321, which is clearly a mistake, so we removed it.

ExperimentsInHonesty commented 2 years ago

@akhaleghi - we finished reviewing this issue

Resources we have questions about

Why are we linking to a specific branch in our repo? Selenium Scraping Tools - see branch. It's not necessarily a problem, but it would be good to know why and document it here.

Rajinder's code seems to be in two places. Please sort out

code on data-science repo with Rajinder's code - this will need to be moved to another directory. It has nothing to do with 311. Its a project for Open Community Survey
Rajinder's personal repo - this seems to be updated more recently than the one on data-science.

Review Action Items above

Please review the action items at the top of this issue, with Ryan and Sophia so that they can identify any new steps we need to add to the issue either because they are missing, or now we need to do something to get it back on track (e.g., sorting out the difference between old and new code from Rajinder).

Missing notes

Also, I remember Rajinder saying something about the API timing out, or having a limit to how many API calls you could make on one IP. So it's possible that we will need to build throttling or IP hoping, into the script if there is none, or ask them for a non-profit license to use their API in exchange for credit/logo placement in our final published public report. But it would be good to get Rajinder to document what he was experiencing, so we don't have to recreate the issue.

akhaleghi commented 2 years ago

Update: Messaged Rajinder to get him to update the files in the data science repository.

willa-mannering commented 2 years ago

Updated the files in the data science repo. The scraper now produces a table including tech categories, tech urls, and total usage count of each tech (total_count). Linked new output as google sheet in readme.

willa-mannering commented 2 years ago

Added new spreadsheet to OCS folder and updated project wiki page

akhaleghi commented 2 years ago

Hi @willa-mannering are there any updates to the issue for this week?

willa-mannering commented 2 years ago

No new updates for this issue, it should be finished now.

ExperimentsInHonesty commented 2 years ago

@willa-mannering We just looked at the results and it looks like we will need to dive deeper into the results that come back for framework. For instance

When clicking on

Organization Schema | https://trends.builtwith.com/framework/Organization-Schema | framework The section with a red outline tells us what we need to know about the framework. In this instance, it's schema.org

schema.org

![Screenshot 2022-06-11 084221](https://user-images.githubusercontent.com/37763229/173194859-dc86eefc-cc3d-4771-a82c-cfd414101607.png)

In the next example its Wordpress Elegant Themes | https://trends.builtwith.com/framework/Elegant-Themes | framework

wordpress.org

![image](https://user-images.githubusercontent.com/37763229/173195087-80d51f98-c0d9-4698-8437-d10a98d4918f.png)

willa-mannering commented 2 years ago

Potential additional information I can collect from each tech type includes: subcategories (i.e. WordPress Theme), tech description, tech website link, number of sites currently using tech, and competing/similar techs.

ambersu123 commented 2 years ago

Pull all the additional info available with the script and then decide what information is needed in the future meeting

akhaleghi commented 1 year ago

@willa-mannering @ambersu123 are there updates on what additional information needs to be pulled with this script?

willa-mannering commented 1 year ago

No decision on what additional info to pull. I've written a script to pull all options mentioned in my previous comment and am now waiting on input from the OCS team.

ExperimentsInHonesty commented 1 year ago

@willa-mannering you said

Potential additional information I can collect from each tech type includes: subcategories (i.e. WordPress Theme), tech description, tech website link, number of sites currently using tech, and competing/similar techs.

OCS team said this in response

Pull all the additional info available with the script and then decide what information is needed in the future meeting

So to be clear, we are saying yes, please pull all the information you said you could pull.

@akhaleghi please add this as a recurring reporting item to our DS/Org agenda

ExperimentsInHonesty commented 1 year ago

@willa-mannering It looks like this got discussed at a meeting but never annotated on this issue, that we only need the above subcategories for the items marked TRUE in the OCS: Builtwith tech_table, tech_categories

The different columns are for our own reference and have no significance for you. Just grab more info for any of the columns marked TRUE.

willa-mannering commented 1 year ago

Added new sheet to OCS table (more_info_tech) including the additional information for the designated categories via the tech_categories sheet.

willa-mannering commented 1 year ago

Need to pull additional data to determine which NC websites are using wordpress, then think of good questions to investigate further based on additional data collected (i.e. how do NC websites use certain technologies?)

willa-mannering commented 1 year ago

Added new table ranking technologies by number of live sites to OCS Google sheet. Still working on pulling data to figure out which techs use WordPress.

willa-mannering commented 1 year ago

Added new table to OCS google sheet with info on which NC sites use Wordpress. Some NC sites were no longer accessible (for example, svanc.org)

kalyaniraman commented 1 year ago

Need to figure out from stand point of making actionable information for the NC's.

Want to know where they stand against other NC's
Want to see where they are against normal websites
Stakeholder want to know what the health of the network is

kalyaniraman commented 1 year ago

@willa-mannering Here is the OCS Google Template for the presentation -

willa-mannering commented 1 year ago

Updated OCS technology analysis presentation to use correct slide template. Finished Analytics portion, began working on Widgets analysis.

akhaleghi commented 1 year ago

Hey @willa-mannering are there any recent updates to this issue?

willa-mannering commented 1 year ago

Currently still working on the technology analysis presentation (which has been added as a link to this issue). I'm editing the analytics and widgets sections according to feedback. After that I will start analyzing the content management system section.

ExperimentsInHonesty commented 1 year ago

Spreadsheet of updated script results sheet we decided we wanted to break down further Content Management system (cms) Mobile SSL Payment Framework Copyright

Further Widget analysis Which sites are using it, what calendars are used for (events of the NC or local events) Calendar Chatbot Search Translate widgets (How many of the website had translations software enabled 7% of all sites use it)

ExperimentsInHonesty commented 1 year ago

@akhaleghi it looks like Willa's update of the script is located in a 311 directory under DS, but it has nothing to do with 311. Let's sort that out.

akhaleghi commented 1 year ago

@willa-mannering are there any updates on this issue?

akhaleghi commented 4 months ago

Next steps:

Review code for scraper here to determine if the scraper still functions.
Add additional functionality to analyze widgets mentioned in Bonnie's comment above, dated 10/17/2022

Rahul-Rut commented 2 months ago

Tasks done: Tweaked the scraper to make it functional again

Added a shell script to handle docker functions (might not commit)
Updated the code to install chrome and chromedriver
Updated the code to properly scrape the website

Rahul-Rut commented 1 month ago

Tasks done:

Merged the two scraper scripts
Created a Jupyter notebook for preliminary analysis of tech; awaiting further inputs

akhaleghi commented 2 weeks ago

@Rahul-Rut Is this issue still being worked on? Is there anything we can do to provide input if you need it?

Rahul-Rut commented 2 weeks ago

@akhaleghi yes, I'll be providing updates soon; just need to wrap it up with a presentation, and I'll reach out in case I require any assistance, thanks!

Rahul-Rut commented 1 week ago

Tasks Done: Updated code Added search functionality to search tech through keywords in description and tech words Used cron to schedule

Input required: Upload files on GitHub A way to publish the results

hackforla / data-science

Builtwith API Project #44

Overview

Action Items

Resources/Instructions

External Tools

Tutorial

Project input (data)

Project output

Rajinder's code

Current presentation

Related issues from OCS

Resources we have questions about

Rajinder's code seems to be in two places. Please sort out

Review Action Items above

Missing notes