Open pdutt111 opened 11 years ago
Hi Pariskshit, thanks for your proposal. I think your project assumes that PageOneX has a coded database with the location of the news. For the moment PageOneX is a tool to help code front pages. In some cases it can be used for that purpose: coding location (country) of certain news. It could be a good implementation for a project based in the tool, but I doubt that it can be a core functionality.
i dont actually want the exact location of the news it can be traced to the city using the article and a marker put there,
also the main aim of the project is not to trace the exact location of the news piece the main aim is to find out the countries where the news reaches this can be found out using the number of papers the newspiece hits and the papers are circulated in which country,
also other aspect is to find the tweets/posts on that news piece and track the sentiments and popularity towards that news piece this feature will surely help better to analyse the reaction towards news items and a person can actually find out how popular a news item is.
i think this addition to the core functionality of pageonex would be a great one.
On Wed, Apr 17, 2013 at 10:08 AM, numeroteca notifications@github.comwrote:
Hi Pariskshit, thanks for your proposal. I think your project assumes that PageOneX has a coded database with the location of the news. For the moment PageOneX is a tool to help code front pages. In some cases it can be used for that purpose: coding location (country) of certain news. It could be a good implementation for a project based in the tool, but I doubt that it can be a core functionality.
— Reply to this email directly or view it on GitHubhttps://github.com/numeroteca/pageonex/issues/124#issuecomment-16486496 .
Before automatizing the extraction of the location of a particular news (city, country, continent) the tool has to be able to get the content of the news. We are envisioning a previous step to access the text of the news via different methods: A. scrap directly from newspaper pdf (make scrapers to get them), B. use OCR, C. match with existing databases (like lexis nexis, although it has some copyright issues), or use newspapers API (like NYTimes). This step will allow researchers to look for words in the front pages and facilitate the process. Would you be interested in working on this? We are open to other solutions as well!
Besides, tracking news in Twitter is also interesting to answer questions like: what is the reach of news that make it to the front page? Which leads to another possible functionality to add: tracking online front pages of newspapers.
Re: gecoding of news. Other group in Civic Media is working specifically on that. Check this: http://globe.mediameter.org/
hi,
ill be happy to work on it.... there is this api for pdf http://pdfbox.apache.org which we can use. how exactly do you plan on building the scrapper.also will the scrapper work independently or will it have to be integrated with some existing code.
thanks, Pariskshit Dutt
On Fri, Apr 26, 2013 at 7:07 PM, numeroteca notifications@github.comwrote:
Before automatizing the extraction of the location of a particular news (city, country, continent) the tool has to be able to get the content of the news. We are envisioning a previous step to access the text of the news via different methods: A. scrap directly from newspaper pdf (make scrapers to get them), B. use OCR, C. match with existing databases (like lexis nexis, although it has some copyright issues), or use newspapers API (like NYTimes). This step will allow researchers to look for words in the front pages and facilitate the process. Would you be interested in working on this? We are open to other solutions as well!
Besides, tracking news in Twitter is also interesting to answer questions like: what is the reach of news that make it to the front page? Which leads to another possible functionality to add: tracking online front pages of newspapers.
Re: gecoding of news. Other group in Civic Media is working specifically on that. Check this: http://globe.mediameter.org/
— Reply to this email directly or view it on GitHubhttps://github.com/numeroteca/pageonex/issues/124#issuecomment-17074672 .
hi,
did you think about it.......when can we start...
thanks, pariskshit
On Fri, Apr 26, 2013 at 10:40 PM, pariskshit dutt pdutt111@gmail.comwrote:
hi,
ill be happy to work on it.... there is this api for pdf http://pdfbox.apache.org which we can use. how exactly do you plan on building the scrapper.also will the scrapper work independently or will it have to be integrated with some existing code.
thanks, Pariskshit Dutt
On Fri, Apr 26, 2013 at 7:07 PM, numeroteca notifications@github.comwrote:
Before automatizing the extraction of the location of a particular news (city, country, continent) the tool has to be able to get the content of the news. We are envisioning a previous step to access the text of the news via different methods: A. scrap directly from newspaper pdf (make scrapers to get them), B. use OCR, C. match with existing databases (like lexis nexis, although it has some copyright issues), or use newspapers API (like NYTimes). This step will allow researchers to look for words in the front pages and facilitate the process. Would you be interested in working on this? We are open to other solutions as well!
Besides, tracking news in Twitter is also interesting to answer questions like: what is the reach of news that make it to the front page? Which leads to another possible functionality to add: tracking online front pages of newspapers.
Re: gecoding of news. Other group in Civic Media is working specifically on that. Check this: http://globe.mediameter.org/
— Reply to this email directly or view it on GitHubhttps://github.com/numeroteca/pageonex/issues/124#issuecomment-17074672 .
Hi Pariskshit, Now that we have launched the tool with all the basics covered we are ready to make the next steps! I will send soon another email through the developers list with the different lines of work that could be developed.
Regarding scraping and pdf:
When cleaning the code the part that was prepared for directly scraping form the newspapers was removed in the file lib/scraper.rb https://github.com/numeroteca/pageonex/commit/ed54f4b7422fad32cbe6b7fde4ce71e41473e262#lib
It was built to get the print front pages from El Pais (pdf) and New York TImes (png). The best scenario is to be able to get the pdf directly from the papers. We started building a database to help build the scrapers http://bit.ly/newspaperfront The user would select from the beginning where he is scraping from, so the first approach is sthat the scraper would beintegrated.
Once we are able to get the pdf we could start thinking on using http://pdfbox.apache.org that you mention. An interesting related project is Xed http://diuf.unifr.ch/main/diva/research/research-projects/xed "a new tool for extracting hidden structures from electronic documents. Document Image Analysis for Libraries, 2004. Proceedings. First International Workshop on. IEEE."
hi,
ok cool......... instead of integrating the scraper i think keeping it seperate and storing info into a database would be better as it would take a lot less time when the user wants to code the paper........... we can store the title, article and location of file in the database .... we can put the scraper in a cronjob to run daily so that the database is updated daily and if a person tries to code that paper before it has been scraped then we can run the scraper at that time. also how are images being scraped is a ocr being used????
On Sun, May 26, 2013 at 10:36 AM, numeroteca notifications@github.comwrote:
Hi Pariskshit, Now that we have launched the tool with all the basics covered we are ready to make the next steps! I will send soon another email through the developers list with the different lines of work that could be developed.
Regarding scraping and pdf:
When cleaning the code the part that was prepared for directly scraping form the newspapers was removed in the file lib/scraper.rb ed54f4b#libhttps://github.com/numeroteca/pageonex/commit/ed54f4b7422fad32cbe6b7fde4ce71e41473e262#lib
It was built to get the print front pages from El Pais (pdf) and New York TImes (png). The best scenario is to be able to get the pdf directly from the papers. We started building a database to help build the scrapers http://bit.ly/newspaperfront The user would select from the beginning where he is scraping from, so the first approach is sthat the scraper would beintegrated.
Once we are able to get the pdf we could start thinking on using http://pdfbox.apache.org that you mention. An interesting related project is Xed http://diuf.unifr.ch/main/diva/research/research-projects/xed "a new tool for extracting hidden structures from electronic documents. Document Image Analysis for Libraries, 2004. Proceedings. First International Workshop on. IEEE."
— Reply to this email directly or view it on GitHubhttps://github.com/numeroteca/pageonex/issues/124#issuecomment-18458127 .
As I mentioned before in this thread, we are not yet reading the newspapers images. We are scraping them and letting users code them. We can take possible ways to help the user find words in the front pages that:
Do you want to take any of these paths to test? The separate database that you mention would be great, but I think it is an entirely new field to open for this project, bigger than the project is itself now.
ill start on the oct .......... i have had a look at the google drive sdk it has an ocr should be good. https://developers.google.com/drive/v2/reference/files/insert there is an ocr operation that we can make google to perform while uploading to drive and retrieve the doc.......
On Wed, Jun 5, 2013 at 11:19 AM, numeroteca notifications@github.comwrote:
As I mentioned before in this thread, we are not yet reading the newspapers images. We are scraping them and letting users code them. We can take possible ways to help the user find words in the front pages that:
- A. scrap directly from newspaper pdf (make scrapers to get them) http://bit.ly/newspaperfront
- B. use OCR, though some images have to small font and are not readeable
- C. match with existing databases (like lexis nexis, although it has some copyright issues), or use newspapers API (like NYTimes).
- D. Once we start coding online front pages, the process of getting the text would be easier.
Do you want to take any of these paths to test? The separate database that you mention would be great, but I think it is an entirely new field to open for this project, bigger than the project is itself now.
— Reply to this email directly or view it on GitHubhttps://github.com/numeroteca/pageonex/issues/124#issuecomment-18956808 .
Can you be more specific and clear? I don't get what you are trying to achieve and how. It looks likepart of the content of your text was redacted.
ill start working on the OCR. there is an ocr in the google drive api that we can use.
On Fri, Jun 7, 2013 at 10:46 AM, numeroteca notifications@github.comwrote:
Can you be more specific and clear? I don't get what you are trying to achieve and how. It looks likepart of the content of your text was redacted.
— Reply to this email directly or view it on GitHubhttps://github.com/numeroteca/pageonex/issues/124#issuecomment-19089399 .
Hi,
i would like to propose a new functionality for pageonex,
we can show on map in what all areas the news has made it to the newspapers front page, like some news might come in only UK while some might make it to close by european countries too, while some might be all over the globe. so we can give an overlay on the map showing the spread of a news on the world map. also we can see the popularity of a news piece using facebook or twitter so that news paper guys can see that what news catches eyeballs. this would be a nice visualisation of the data on pageonex.
we can use
facebook graph search https://developers.facebook.com/docs/reference/api/search/ or twitter search api https://dev.twitter.com/docs/api/1/get/search for the data to find popularity of news and google maps api for the maps https://developers.google.com/maps/
Thanks, Pariskshit Dutt, India