montera34 / pageonex

PageOneX. Analyzing front pages
http://pageonex.com
GNU Affero General Public License v3.0
54 stars 13 forks source link

Wrong images downloaded: newspapers with the same media.name. Related to #168. #170

Closed numeroteca closed 10 years ago

numeroteca commented 11 years ago

Related to #168.

Found a bug while downloading new images in a thread http://pageonex.com/numeroteca/corrupcion-espana-julio-2013/: the wrong "elmundo" images were wrongly downloaded into the thread for days July 3-6 retroactively (days before I had the correct ElMundo, from Spain).

Here the two newspapers: El Salvador,sv,El Mundo,elmundo,http://www.elmundo.com.sv/ Spain,es,El Mundo,elmundo,http://www.elmundo.es/

So, I guess for the scraping also we should not assume that media.name is unique.

rahulbot commented 10 years ago

This process fixes it going forward:

rake db:migrate
rake scraping:migrate_media_folders_to_include_country_codes

However, I don't think I can fix it for previously downloaded images where more than one media source has the same name, because I don't know which media source the image is from :-(

If we want to fix this thoroughly, this might work:

  1. make a list of all the media sources that have duplicate names
  2. find all the threads that reference those sources
  3. re-download all the images for each of those threads (by doing this on the rails console Threadx.find_by_thread_name([slug]).scrape_all_images true)
rahulbot commented 10 years ago

Updated on dev and production - the new folder names that include country-code seem to be working.

numeroteca commented 10 years ago

I've only seen this problem with the elmundo newspapers mentioned above (another user reported the same today). I need the Spanish Elmundo.

I am testing with this thread http://pageonex.com/numeroteca/quien-escribe-las-noticias/

I tried to run the code to fix it in the production console (RAILS_ENV="production" rails console): Threadx.find_by_thread_name('quien-escribe-las-noticias').scrape_all_images true

but I get many errors like: ... Image Load (57.0ms) SELECTimages.* FROMimagesWHEREimages.media_id= 146 ANDimages.publication_date= '2013-07-08' ORDER BY publication_date ASC, media_id ASC LIMIT 1 Media Load (0.3ms) SELECTmedia.* FROMmediaWHEREmedia.working= 1 ANDmedia.id` = 146 LIMIT 1 Image Download Failed:57447: couldn't find image at http://img.kiosko.net/2013/07/08/es/elmundo.750.jpg (Permission denied - app/assets/images/kiosko/es-elmundo/elmundo-2013-07-08.jpg)

... (0.1ms) BEGIN (0.1ms) COMMIT Image Load (49.9ms) SELECT images.* FROM images WHERE images.media_id = 490 AND images.publication_date = '2013-07-08' ORDER BY publication_date ASC, media_id ASC LIMIT 1 Media Load (0.2ms) SELECT media.* FROM media WHERE media.working = 1 AND media.id = 490 LIMIT 1 Image Download Failed:57790: couldn't find image at http://img.kiosko.net/2013/07/08/es/elpais.750.jpg (Permission denied - app/assets/images/kiosko/es-elpais/elpais-2013-07-08.jpg) (0.1ms) BEGIN (0.1ms) COMMIT Image Load (47.6ms) SELECT images.* FROM images WHERE images.media_id = 490 AND images.publication_date = '2013-07-09' ORDER BY publication_date ASC, media_id ASC LIMIT 1 Media Load (0.2ms) SELECT media.* FROM media WHERE media.working = 1 AND media.id = 490 LIMIT 1 Image Download Failed:58024: couldn't find image at http://img.kiosko.net/2013/07/09/es/elpais.750.jpg (Permission denied - app/assets/images/kiosko/es-elpais/elpais-2013-07-09.jpg) (0.1ms) BEGIN`

Is it just a problem of permissions?

Besides, now some thumbnails in the composite are missing and the bars above those days are missing. pageonex_missingimages

numeroteca commented 10 years ago

I created a thread with the two Elmundo newspapers http://pageonex.com/numeroteca/el-mundo-test/ and I saw a lot of different errors:

I see that there is no folder created in app/assets/images for the El Salvador newspaper: sv-elmundo, which might be part of the problem!

[I added you as collaborators in the thread.]

rahulbot commented 10 years ago

I fixes the image storage issues. I think the image mp and coding-carousel problems are also tied to the assumption of unique media names.

rahulbot commented 10 years ago

I tried those two threads and are working now. Let me know if you run into any other weirdness... this (incorrect) assumption of unique media names clearly has a lot of places we need to fix.

numeroteca commented 10 years ago

1st weirdness: When I draw an area in the Spanish El mundo, the same area is drawn in the El Salvador one for the same day(and viceversa) in the coding view. Then the areas only are displayed in the Spanish newspaper in the display view. For case: http://pageonex.com/numeroteca/el-mundo-test/

elplatt commented 10 years ago

Closed by f3fc0915bb85b026ebdf46406031de0fafe062f0

numeroteca commented 10 years ago

I've found new conflict between the Argentinian and Paraguayan "La Nacion". I saw it in this series of threads by a user: the display view was working before, and now the La Nacion is not working any more http://pageonex.com/marielb/ley-de-voto-a-los-16-2/ or http://pageonex.com/marielb/ley-de-voto-a-los-16-1/

I created another thread to test with http://pageonex.com/numeroteca/test-repeated/ Argentinian nacion images are mixed with py nacion.

rahulbot commented 10 years ago

I think the "La Nacion" issue is a holdover from existing images that were fetched when the code wasn't smart about papers with the same name. To fix this, I rescraped the images for thread you mentioned - test-repeated:

Threadx.find_by_thread_name('test-repeated').scrape_all_images true

The other two (ley-de-voto-a-los-16-1, ley-de-voto-a-los-16-2), were related to the issues around thumbnails that exist but have size 0. I added some code (059392598b66ffb551f49aa37f323ff80f5bb0b4) to handle this better and now those two work (after I rescraped all the images).