viragumathe5 / Web-crawler-for-wayback-machine

The repository is for the submission of the patch for the Internet Archives for GSoC 2020
Apache License 2.0
2 stars 1 forks source link

Please review the Proposal #4

Open viragumathe5 opened 4 years ago

viragumathe5 commented 4 years ago

@cclauss will you please review my GSoC proposal I will definitely look forward to your words

Thank You

viragumathe5 commented 4 years ago

@cclauss Please see proposal.md

Thank You

cclauss commented 4 years ago

Really cool! It sounds like a fun project.

Do the current wayback machine web scrapers deal with news and social media? If yes, what improvements need to be made and why?

What are some news sites that you propose to crawl? What are some social media sites that you propose to crawl?

How will you ensure that your approach is efficient and will not overburden the target websites?

What are some of the sources of web scraping best practice that you intend to use?

cclauss commented 4 years ago

Also why Python? The current codebase is Java so why make the shift? Will Python make the functionality easier to implement? Are there more mature web scraping capabilities in the Python community that in the Java community?

Will it be easier for new contributors to add new sites and new functionality when this work iis done?

Will it be possible to add new sites via configuration or will new code have to be written for each new site?

viragumathe5 commented 4 years ago

@cclauss The reason I am using Python here for scraping and crawling is as

  1. The codebase is written in Java still it makes the functionalities easier for the organization and easy to operate for the use of the community
  2. It will be easier for us to make it comparable with other languages
  3. Python has very generous libs for scraping some of them are BeautifulSoup4 Scrapy etc..
  4. We will get full web support accessibility from the libs like requests and envelope etc.
  5. It's easier for the new minds in the opensource and Internet Archive to get in and contribute to it

My personal view over it

  1. I am working on Web Scraping using Python for the past 10 months and really comfortable with it.
  2. I am also working on information retrieval and doing research on it

These all reasons shows why I am proposing the solution in python Still, you want to use Java for the solution I am flexible enough and definitely complete the project

viragumathe5 commented 4 years ago

@cclauss I found one Wayback machine scraper but it does not support the scraping attributes like messages and social media content bit they provide the CLI for it I really think if we are possibly able to give the CLI for the scraper so it would be great what do you think about this ???

If you liked this I will add this in my Proposal so that I can prepare for it in Community bonding period

cclauss commented 4 years ago

I am OK with using Python. I just wanted to make sure that your proposal provided a strong justification for why you chose to use Python.

viragumathe5 commented 4 years ago

Ok sure will add it definitely

According to your concern for the Wayback Machines Scraper, I got the repository which will use Wayback Scraper but it doesn't work on messages and social media content here its link https://github.com/sangaline/wayback-machine-scraper

And I wanted to ask you I was really attracted towards this project from the start but the project idea contain no mentor name or mail or any repository so and also the page was not specified with the community so os this project having the least priority for the GSoC 2020 or something coz the only idea was there in Gdoc so ??? may I know something about this please???

cclauss commented 4 years ago

@mekarpeles or @kngenie would be better able to comment on the strategic importance of improved crawlers in the wayback machine codebase.

viragumathe5 commented 4 years ago

If you think any other change for the Proposal, please tell me

Thank You

viragumathe5 commented 4 years ago

@mekarpeles and @kngenie would you please review my proposal and guide me. I will definitely look forward to your every word

thank you

viragumathe5 commented 4 years ago

@cclauss I have updated the proposal with all the changes you suggested thank you so much for the suggestions. Please suggest me if there's any Thank You

viragumathe5 commented 4 years ago

@cclauss does that result looks satisfying??? or should we change to something

mekarpeles commented 4 years ago

I have to defer completely to @kngenie and Mark Graham -- I can only speak to OpenLibrary.org. @viragumathe5 I'd make sure to email Mark (again, if necessary) for next steps.

viragumathe5 commented 4 years ago

I am totally unable to reach @mark I really don't know why it would be really very beneficial for me if you do so

Thank You so much