Closed SebastianZimmeck closed 3 months ago
The crawler is in progress right now. Things to take note for me:
As per our conversation on call, I have compared the result of some sites on the June crawl pt1 and April crawl pt1 and they looked similar -- the crawl is looking great, so far! Just to track the progress, I'm currently on the third set and in the meantime, I'm changing all the formats for the json files to be readable and also reading through the Google Collab.
Nice, @franciscawijaya!
@franciscawijaya, can you add the following URL to the end of the crawl list and include it in your crawl going forward?
https://www.washingtonpost.com/
(cc'ing @AramZS)
Added to the 8th batch!
The last batch of the crawl is now done and everything looks great so far!
Next step: I will now begin to parse and analyze the crawl data which would be finished latest by our Thursday meeting.
Excellent! Great news!
Update: I have transferred all of the crawl data to the Google Drive and am starting to collate the redo_sites.csv using the Google Collab now that we have all the data. However, I'm currently facing an error when running one of the lines of code and have been struggling to figure out where it went wrong. But, I have reached out to labmates for output and I will continue on debugging.
Solved! I'm now running the redo sites (Google Collab collated 720 sites without subdomains to be crawled).
An update: I have crawled the redo sites and I also tried to run the well-known script. However, I faced a problem in which only the well-known-data.csv was updated and not the well-known-errors.csv once the code fully ran. Nevertheless, I think I found out what the problem was and am now re-running the script. Hopefully, I would have both the well-known-data and well-known-errors by tomorrow morning and can then start working on parsing and analyzing to get all the figures too.
As discussed, @franciscawijaya if you can do the following:
Then, feel free to close this issue.
As mentioned in the meeting, while I am done with the crawl, I am still figuring out the parsing/analyzing of the data and creating the figures which would be my task for this week. I will be closing this issue now as I'm done with the crawl and will make a new issue in gpc-web-crawler-paper to post the figures and data once I finish working out the analysis and figures.
@franciscawijaya will perform the crawl (with possible help from @katehausladen).