kevinzg / facebook-scraper

Scrape Facebook public pages without an API key
MIT License
2.27k stars 614 forks source link

Translated text in posts with `get_posts` method #670

Open kostyakoz opened 2 years ago

kostyakoz commented 2 years ago

Hello! I'm getting posts using get_posts method and when I get non-English text of original post it's include translated text right after original in post_text.

Is there any ways to get only original text without translation? Maybe translated version is wrapped in some extra HTML tags?

neon-ninja commented 2 years ago

Can you share a link to an example post that causes this problem?

kostyakoz commented 2 years ago

It appears with every non-English post actually (Russian, Italian, Polish etc.):

https://facebook.com/262108804192078/posts/1461142234288723 https://facebook.com/262108804192078/posts/1461123174290629

Just few examples.

Screenshots

Original post written in Russian:

Screenshot 2022-02-12 at 15 12 03

post_text result of it:

Screenshot 2022-02-12 at 15 11 49
neon-ninja commented 2 years ago

You can disable translations in your Facebook account settings (https://www.facebook.com/settings?tab=language) - does that work for you?

kostyakoz commented 2 years ago

Unfortunately, because I'm using many pre-made accounts with cookies I'm unable to change settings for every one. So that's why I'm trying to extract original text with scraper only.

neon-ninja commented 2 years ago

I see. This commit (https://github.com/kevinzg/facebook-scraper/commit/a38acef545019c3cf8a8f31c443c50ce81f41158) adds a new key to the post result dict, called original_text. Here's what https://facebook.com/262108804192078/posts/1461142234288723 looks like when extracted using the code as at that commit:

{'available': True,
 'comments': 0,
 'comments_full': None,
 'factcheck': None,
 'image': 'https://dtf.ru/cover/fb/c/1076271/1644663329/cover.jpg',
 'image_id': None,
 'image_ids': [],
 'image_lowquality': 'https://external.fakl8-1.fna.fbcdn.net/safe_image.php?d=AQEQ-tUWi-2nRq0u&w=476&h=249&url=https%3A%2F%2Fdtf.ru%2Fcover%2Ffb%2Fc%2F1076271%2F1644663329%2Fcover.jpg&cfs=1&jq=75&ext=jpg&_nc_oe=6f924&_nc_sid=06c271&ccb=3-5&gt=1&_nc_hash=AQFZsR7vCOfH1mDf',
 'images': ['https://dtf.ru/cover/fb/c/1076271/1644663329/cover.jpg'],
 'images_description': [],
 'images_lowquality': ['https://external.fakl8-1.fna.fbcdn.net/safe_image.php?d=AQEQ-tUWi-2nRq0u&w=476&h=249&url=https%3A%2F%2Fdtf.ru%2Fcover%2Ffb%2Fc%2F1076271%2F1644663329%2Fcover.jpg&cfs=1&jq=75&ext=jpg&_nc_oe=6f924&_nc_sid=06c271&ccb=3-5&gt=1&_nc_hash=AQFZsR7vCOfH1mDf'],
 'images_lowquality_description': [None],
 'is_live': False,
 'likes': 0,
 'link': 'https://dtf.ru/anime/1076271?fbclid=IwAR1H654cd88f3e5JFqlspnLJkE8x-\\-\\Vi3XfHLXvaKwOzj4hLxj1GxnRbHEo',
 'links': [],
 'original_request_url': 'https://facebook.com/262108804192078/posts/1461142234288723',
 'original_text': 'Студия MAPPA официально анонсировала второй сезон аниме '
                  '«Магическая битва», чей полнометражный приквел стал самым '
                  'кассовым фильмом в японском прокате.\n'
                  '\n'
                  'Продолжение сериала о школьнике Юдзи выйдет в начале 2023 '
                  'года.',
 'page_id': '262108804192078',
 'post_id': 1461142234288723,
 'post_text': 'Студия MAPPA официально анонсировала второй сезон аниме '
              '«Магическая битва», чей полнометражный приквел стал самым '
              'кассовым фильмом в японском прокате.\n'
              '\n'
              'Продолжение сериала о школьнике Юдзи выйдет в начале 2023 '
              'года.\n'
              '\n'
              'MAPPA has officially announced the second season of the anime '
              '"Magic Battle", whose full-fledged prequel became the most '
              'cashmere film in Japan.\n'
              '\n'
              'The continuation of the series about a schoolgirl Yuji will be '
              'released in early 2023.',
 'post_url': 'https://facebook.com/story.php?story_fbid=1461142234288723&id=262108804192078',
 'reaction_count': None,
 'reactions': None,
 'reactors': None,
 'shared_post_id': None,
 'shared_post_url': None,
 'shared_text': 'DTF.RU\n'
                'Студия MAPPA анонсировала второй сезон «Магической битвы» — '
                'он выйдет в 2023 году — Аниме на DTF',
 'shared_time': None,
 'shared_user_id': None,
 'shared_username': None,
 'sharers': None,
 'shares': 0,
 'text': 'Студия MAPPA официально анонсировала второй сезон аниме «Магическая '
         'битва», чей полнометражный приквел стал самым кассовым фильмом в '
         'японском прокате.\n'
         '\n'
         'Продолжение сериала о школьнике Юдзи выйдет в начале 2023 года.\n'
         '\n'
         'MAPPA has officially announced the second season of the anime "Magic '
         'Battle", whose full-fledged prequel became the most cashmere film in '
         'Japan.\n'
         '\n'
         'The continuation of the series about a schoolgirl Yuji will be '
         'released in early 2023.\n'
         '\n'
         'DTF.RU\n'
         'Студия MAPPA анонсировала второй сезон «Магической битвы» — он '
         'выйдет в 2023 году — Аниме на DTF',
 'time': datetime.datetime(2022, 2, 13, 0, 2, 22),
 'timestamp': 1644663742,
 'user_id': '262108804192078',
 'user_url': 'https://facebook.com/playdtf/?__tn__=C-R',
 'username': 'DTF',
 'video': None,
 'video_duration_seconds': None,
 'video_height': None,
 'video_id': None,
 'video_quality': None,
 'video_size_MB': None,
 'video_thumbnail': None,
 'video_watches': None,
 'video_width': None,
 'w3_fb_url': None,
 'was_live': False,
 'with': None}

Please give it a try and check it works ok for you.

kostyakoz commented 2 years ago

This is working perfectly. Thank you so much for help! 👍

kostyakoz commented 2 years ago

@neon-ninja hello again! I've noticed that long text posts have ellipsis and "... More" label included in the original text. You can check it out with post: https://facebook.com/262108804192078/posts/1463319394071007

Screenshot 2022-02-15 at 19 00 36
neon-ninja commented 2 years ago

I see - try https://github.com/kevinzg/facebook-scraper/commit/527eabafe980347a51da3e9e557cf6a752d75a2a

Moiz-khan commented 2 years ago

use google_trans_new after scraping data