kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.4k stars 627 forks source link

media urls [issue] #251

Closed fashan7 closed 3 years ago

fashan7 commented 3 years ago

I noticed there are some issue while scraping media URLs

https://www.facebook.com/EcomnewsAfrique/posts/2564824823826632 This URL doesn't return the image in the images key

"images":[]

https://www.facebook.com/go.up.afrika/posts/2358414964302425 This URL doesn't return the video and the video is NONE

`[
   {
      "post_id":"2358414964302425",
      "text":"🔴🔴🔴Afrik-inform Awards\nAristide Bounah, président du comité d'organisation de l'événement que tout le Cameroun attend était l'invité du journal ce lundi sur Canal2 international.\n\nContinuez de voter pour votre candidat préféré aux Afrik-Inform Awards via ce lien.\nhttps://afrik-inform-awards.faroty.com/index.php\n\nUn citoyen, un vote.",
      "post_text":"🔴🔴🔴Afrik-inform Awards\nAristide Bounah, président du comité d'organisation de l'événement que tout le Cameroun attend était l'invité du journal ce lundi sur Canal2 international.\n\nContinuez de voter pour votre candidat préféré aux Afrik-Inform Awards via ce lien.\nhttps://afrik-inform-awards.faroty.com/index.php\n\nUn citoyen, un vote.",
      "shared_text":"",
      "time":datetime.datetime(2021,
      5,
      11,
      12,
      58,
      25),
      "image":"None",
      "image_lowquality":"https://static.xx.fbcdn.net/rsrc.php/v3/yO/r/tlBBJvkvBea.png",
      "images":[

      ],
      "images_description":[

      ],
      "images_lowquality":[
         "https://static.xx.fbcdn.net/rsrc.php/v3/yO/r/tlBBJvkvBea.png"
      ],
      "images_lowquality_description":[
         "None"
      ],
      "video":"None",
      "video_duration_seconds":"None",
      "video_height":"None",
      "video_id":"None",
      "video_quality":"None",
      "video_size_MB":"None",
      "video_thumbnail":"None",
      "video_watches":"None",
      "video_width":"None",
      "likes":2,
      "comments":0,
      "shares":4,
      "post_url":"https://facebook.com/story.php?story_fbid=2358414964302425&id=995659197244682",
      "link":"https://afrik-inform-awards.faroty.com/index.php",
      "user_id":"995659197244682",
      "username":"GO Afrika",
      "user_url":"https://facebook.com/go.up.afrika/?__tn__=C-R",
      "is_live":false,
      "factcheck":"None",
      "shared_post_id":"None",
      "shared_time":"None",
      "shared_user_id":"None",
      "shared_username":"None",
      "shared_post_url":"None",
      "available":false,
      "comments_full":[

      ],
      "reactors":"None",
      "w3_fb_url":"https://www.facebook.com/go.up.afrika/posts/2358414964302425",
      "reactions":{
         "like":2
      },
      "fetched_time":datetime.datetime(2021,
      5,
      12,
      0,
      8,
      43,
      304419)
   }
]`

@neon-ninja please check

neon-ninja commented 3 years ago

2564824823826632 isn't an image post, it's a link. The image shown is actually a preview image from an external site, and clicking that link takes you to the external site.

2358414964302425 requires you to be logged in to see the video, try open https://m.facebook.com/2358414964302425 in an incognito window.

posts = list(get_posts(
    post_urls=[2358414964302425],
    cookies="cookies.txt"
))
pprint.pprint(posts)
[{'available': True,
  'comments': 0,
  'comments_full': None,
  'factcheck': None,
  'image': None,
  'image_lowquality': None,
  'images': [],
  'images_description': [],
  'images_lowquality': [],
  'images_lowquality_description': [None],
  'is_live': False,
  'likes': 2,
  'link': 'https://afrik-inform-awards.faroty.com/index.php?fbclid=IwAR0cmVK0aaTCKTWbaDnYVF6ea-8j4TJAUZD2OL77SD8KuE4f1mn3czoHxGU',
  'post_id': '2358414964302425',
  'post_text': '🔴🔴🔴Afrik-inform Awards\n'
               "Aristide Bounah, président du comité d'organisation de "
               "l'événement que tout le Cameroun attend était l'invité du "
               'journal ce lundi sur Canal2 international.\n'
               '\n'
               'Continuez de voter pour votre candidat préféré aux '
               'Afrik-Inform Awards via ce lien.\n'
               'https://afrik-inform-awards.faroty.com/index.php\n'
               '\n'
               'Un citoyen, un vote.\n'
               '\n'
               '🔴🔴🔴Afrik-inform Awards\n'
               'Aristide Bounah, chairman of the organizing committee of the '
               'event that all Cameroon is waiting for was the guest of the '
               'newspaper this Monday on Canal2 international.\n'
               '\n'
               'Keep voting for your favorite Afrik-Inform Awards candidate '
               'via this link.\n'
               'https://afrik-inform-awards.faroty.com/index.php\n'
               '\n'
               'One citizen, one vote.',
  'post_url': 'https://facebook.com/story.php?story_fbid=2358414964302425&id=995659197244682',
  'reactors': None,
  'shared_post_id': None,
  'shared_post_url': None,
  'shared_text': '',
  'shared_time': None,
  'shared_user_id': None,
  'shared_username': None,
  'shares': 4,
  'text': '🔴🔴🔴Afrik-inform Awards\n'
          "Aristide Bounah, président du comité d'organisation de l'événement "
          "que tout le Cameroun attend était l'invité du journal ce lundi sur "
          'Canal2 international.\n'
          '\n'
          'Continuez de voter pour votre candidat préféré aux Afrik-Inform '
          'Awards via ce lien.\n'
          'https://afrik-inform-awards.faroty.com/index.php\n'
          '\n'
          'Un citoyen, un vote.\n'
          '\n'
          '🔴🔴🔴Afrik-inform Awards\n'
          'Aristide Bounah, chairman of the organizing committee of the event '
          'that all Cameroon is waiting for was the guest of the newspaper '
          'this Monday on Canal2 international.\n'
          '\n'
          'Keep voting for your favorite Afrik-Inform Awards candidate via '
          'this link.\n'
          'https://afrik-inform-awards.faroty.com/index.php\n'
          '\n'
          'One citizen, one vote.',
  'time': datetime.datetime(2021, 5, 11, 19, 57, 3, 455941),
  'user_id': '995659197244682',
  'user_url': 'https://facebook.com/go.up.afrika/?refid=52&__tn__=C-R',
  'username': 'GO Afrika',
  'video': 'https://video.fakl1-2.fna.fbcdn.net/v/t42.1790-2/185707850_2298739410259997_4927790081844978894_n.mp4?_nc_cat=100&ccb=1-3&_nc_sid=985c63&efg=eyJ2ZW5jb2RlX3RhZyI6InN2ZV9zZCJ9&_nc_ohc=lWu89KwfYykAX_Y9x9f&tn=9rjm9JeVMBQjJ4A_&_nc_rml=0&_nc_ht=video.fakl1-2.fna&oh=f7839fb4895211bacb6cee05d6d42303&oe=609B3B96',
  'video_duration_seconds': 207,
  'video_height': 360,
  'video_id': '897588874417943',
  'video_quality': '360p',
  'video_size_MB': 11.979237,
  'video_thumbnail': 'https://scontent.fakl1-3.fna.fbcdn.net/v/t15.5256-10/cp0/e15/q65/s320x320/182256987_897589137751250_2548767959302356298_n.jpg?_nc_cat=103&ccb=1-3&_nc_sid=ccf8b3&efg=eyJpIjoidCJ9&_nc_ohc=EFRKZ3us9bcAX_-G4JW&_nc_ht=scontent.fakl1-3.fna&tp=9&oh=72bc3076f6967f2043fb2b5c79d4d193&oe=60C08F73',
  'video_watches': 51,
  'video_width': 640,
  'w3_fb_url': None}]
fashan7 commented 3 years ago

2564824823826632 couldn't we extract that image and set into the image list @neon-ninja

neon-ninja commented 3 years ago

Yes, it should be extracted as image_lowquality

[{'available': True,
  'comments': 0,
  'comments_full': None,
  'factcheck': None,
  'image': None,
  'image_lowquality': 'https://external.fakl1-3.fna.fbcdn.net/safe_image.php?d=AQFjWCHrahv3oAwS&w=476&h=249&url=https%3A%2F%2Fecomnewsafrique.com%2Fapp%2Fuploads%2Fsites%2F4%2F2021%2F05%2Fconakry-guinee.png&cfs=1&jq=75&ext=jpg&ccb=3-5&_nc_hash=AQEt4q2LMZRCBMEK',
  'images': [],
  'images_description': [],
  'images_lowquality': ['https://external.fakl1-3.fna.fbcdn.net/safe_image.php?d=AQFjWCHrahv3oAwS&w=476&h=249&url=https%3A%2F%2Fecomnewsafrique.com%2Fapp%2Fuploads%2Fsites%2F4%2F2021%2F05%2Fconakry-guinee.png&cfs=1&jq=75&ext=jpg&ccb=3-5&_nc_hash=AQEt4q2LMZRCBMEK'],
  'images_lowquality_description': [None],
  'is_live': False,
  'likes': 0,
  'link': 'https://bit.ly/3o4lBrM',
  'post_id': '2564824823826632',
  'post_text': 'Les mesures mises en place dans le cadre de la propagation du '
               '#coronavirus ont eu un impact négatif sur nombre d’économies '
               'africaines. Malgré ces contraintes, la #Guinée fait partie des '
               'rares pays africains à avoir maintenu une croissance positive '
               'en 2020 https://bit.ly/3o4lBrM',
  'post_url': 'https://facebook.com/story.php?story_fbid=2564824823826632&id=1932987740343680',
  'reactors': None,
  'shared_post_id': None,
  'shared_post_url': None,
  'shared_text': 'ECOMNEWSAFRIQUE.COM\n'
                 'La Guinée a connu une croissance économique de 7 % en 2020 '
                 'et cela en dépit de la crise sanitaire - Ecomnews Afrique',
  'shared_time': None,
  'shared_user_id': None,
  'shared_username': None,
  'shares': 0,
  'text': 'Les mesures mises en place dans le cadre de la propagation du '
          '#coronavirus ont eu un impact négatif sur nombre d’économies '
          'africaines. Malgré ces contraintes, la #Guinée fait partie des '
          'rares pays africains à avoir maintenu une croissance positive en '
          '2020 https://bit.ly/3o4lBrM\n'
          '\n'
          'ECOMNEWSAFRIQUE.COM\n'
          'La Guinée a connu une croissance économique de 7 % en 2020 et cela '
          'en dépit de la crise sanitaire - Ecomnews Afrique',
  'time': datetime.datetime(2021, 5, 11, 22, 30, 33),
  'user_id': '1932987740343680',
  'user_url': 'https://facebook.com/EcomnewsAfrique/?refid=52&__tn__=C-R',
  'username': 'Ecomnews Afrique',
  'video': None,
  'video_duration_seconds': None,
  'video_height': None,
  'video_id': None,
  'video_quality': None,
  'video_size_MB': None,
  'video_thumbnail': None,
  'video_watches': None,
  'video_width': None,
  'w3_fb_url': None}]

Or are you saying the scraper should somehow extract https://ecomnewsafrique.com/app/uploads/sites/4/2021/05/conakry-guinee.png ?

urllib.parse.parse_qs("https://external.fakl1-3.fna.fbcdn.net/safe_image.php?d=AQFjWCHrahv3oAwS&w=476&h=249&url=https%3A%2F%2Fecomnewsafrique.com%2Fapp%2Fuploads%2Fsites%2F4%2F2021%2F05%2Fconakry-guinee.png&cfs=1&jq=75&ext=jpg&ccb=3-5&_nc_hash=AQEt4q2LMZRCBMEK")["url"][0]

seems to do the trick

neon-ninja commented 3 years ago

Try https://github.com/kevinzg/facebook-scraper/commit/f4ea4af873d9907b236b816365d0f0a7d979437c

fashan7 commented 3 years ago

@neon-ninja what is this "available":false,?

neon-ninja commented 3 years ago

See https://github.com/kevinzg/facebook-scraper/pull/185

fashan7 commented 3 years ago

thanks for the information

fashan7 commented 3 years ago

@neon-ninja how can i learn this program (html-request). I like to learn this. How did u learn this.

neon-ninja commented 3 years ago

Just from the documentation - https://docs.python-requests.org/projects/requests-html/en/latest/

fashan7 commented 3 years ago

is there any other source to be an expert with this. facebook-scraper is advance right?

neon-ninja commented 3 years ago

Not really. What do you mean by advance? As with anything, the way you become an expert is just by having lots of practice - just trying lots of different things, writing & running lots of different bits of code. It probably helped that I was already very familiar with Python, requests, bs4, web scraping, HTML/CSS/JS etc