Open viragumathe5 opened 4 years ago
@cclauss Please see proposal.md
Thank You
Really cool! It sounds like a fun project.
Do the current wayback machine web scrapers deal with news and social media? If yes, what improvements need to be made and why?
What are some news sites that you propose to crawl? What are some social media sites that you propose to crawl?
How will you ensure that your approach is efficient and will not overburden the target websites?
What are some of the sources of web scraping best practice that you intend to use?
Also why Python? The current codebase is Java so why make the shift? Will Python make the functionality easier to implement? Are there more mature web scraping capabilities in the Python community that in the Java community?
Will it be easier for new contributors to add new sites and new functionality when this work iis done?
Will it be possible to add new sites via configuration or will new code have to be written for each new site?
@cclauss The reason I am using Python here for scraping and crawling is as
My personal view over it
These all reasons shows why I am proposing the solution in python Still, you want to use Java for the solution I am flexible enough and definitely complete the project
@cclauss I found one Wayback machine scraper but it does not support the scraping attributes like messages and social media content bit they provide the CLI for it I really think if we are possibly able to give the CLI for the scraper so it would be great what do you think about this ???
If you liked this I will add this in my Proposal so that I can prepare for it in Community bonding period
I am OK with using Python. I just wanted to make sure that your proposal provided a strong justification for why you chose to use Python.
Ok sure will add it definitely
According to your concern for the Wayback Machines Scraper, I got the repository which will use Wayback Scraper but it doesn't work on messages and social media content here its link https://github.com/sangaline/wayback-machine-scraper
And I wanted to ask you I was really attracted towards this project from the start but the project idea contain no mentor name or mail or any repository so and also the page was not specified with the community so os this project having the least priority for the GSoC 2020 or something coz the only idea was there in Gdoc so ??? may I know something about this please???
@mekarpeles or @kngenie would be better able to comment on the strategic importance of improved crawlers in the wayback machine codebase.
If you think any other change for the Proposal, please tell me
Thank You
@mekarpeles and @kngenie would you please review my proposal and guide me. I will definitely look forward to your every word
thank you
@cclauss I have updated the proposal with all the changes you suggested thank you so much for the suggestions. Please suggest me if there's any Thank You
@cclauss does that result looks satisfying??? or should we change to something
I have to defer completely to @kngenie and Mark Graham -- I can only speak to OpenLibrary.org. @viragumathe5 I'd make sure to email Mark (again, if necessary) for next steps.
I am totally unable to reach @mark I really don't know why it would be really very beneficial for me if you do so
Thank You so much
@cclauss will you please review my GSoC proposal I will definitely look forward to your words
Thank You