lairdshaw / fups

FUPS: Forum user-post scraper
GNU Affero General Public License v3.0
21 stars 9 forks source link

end not detected #5

Closed TiloGit closed 5 years ago

TiloGit commented 5 years ago

Got issue that the end is not proper detected and it keep going indefinite. Last page has 5 topics but it would

Found 5 topics on page 5 in forum with ID "566" Found 5 topics on page 6 in forum with ID "566". Found 5 topics on page 7 in forum with ID "566".


# php /tiloBB/FUPS/fups-master/fups.php -i /tiloBB/optionsfileExample.txt -o /tiloBB/tiloOut/
0s Reading settings.
SETTINGS:
array (
  'forum_type' => 'phpBB',
  'base_url' => 'https://www.phpbb.com/community',
  'extract_user_id' => '',
  'extract_user' => '',
  'forum_ids' => '566',
  'login_user' => '',
  'login_password' => '',
  'start_from_date' => '2016-10-17 19:46',
  'php_timezone' => 'America/Los_Angeles',
  'download_images' => true,
  'non_us_date_format' => false,
  'debug' => true,
  'delay' => '5',
  'download_attachments' => false,
  'earliest' => 1476758760,
)
0s Finished reading settings.
$this->settings['forum_ids_arr'] == array (
  0 => '566',
)
Set cookie_filename to "/tiloBB/FUPS/workingDIR/_tiloBB_optionsfileExample.txt.cookies.txt".
In do_send(), retrieving URL <https://www.phpbb.com/community/ucp.php?mode=login>
Site title: phpBB &bull; User Control Panel &bull; Login
Entered progress level 7
Entered progress level 8
0s Attempting to scrape page 1 of forum with ID 566.
Waiting courteously for 5 seconds.
In do_send(), retrieving URL <https://www.phpbb.com/community/viewforum.php?f=566&start=0>
Found 23 topics on page 1 in forum with ID "566".
6s Attempting to scrape page 2 of forum with ID 566.
Waiting courteously for 5 seconds.
In do_send(), retrieving URL <https://www.phpbb.com/community/viewforum.php?f=566&start=23>
Found 25 topics on page 2 in forum with ID "566".
11s Attempting to scrape page 3 of forum with ID 566.
Waiting courteously for 5 seconds.
In do_send(), retrieving URL <https://www.phpbb.com/community/viewforum.php?f=566&start=48>
Found 7 topics on page 3 in forum with ID "566".
16s Attempting to scrape page 4 of forum with ID 566.
Waiting courteously for 5 seconds.
In do_send(), retrieving URL <https://www.phpbb.com/community/viewforum.php?f=566&start=55>
Found 5 topics on page 4 in forum with ID "566".
22s Attempting to scrape page 5 of forum with ID 566.
Waiting courteously for 5 seconds.
In do_send(), retrieving URL <https://www.phpbb.com/community/viewforum.php?f=566&start=60>
Found 5 topics on page 5 in forum with ID "566".
27s Attempting to scrape page 6 of forum with ID 566.
Waiting courteously for 5 seconds.
In do_send(), retrieving URL <https://www.phpbb.com/community/viewforum.php?f=566&start=65>
Found 5 topics on page 6 in forum with ID "566".
32s Attempting to scrape page 7 of forum with ID 566.
Waiting courteously for 5 seconds.
^C
lairdshaw commented 5 years ago

Thanks for your report. Please let me know if the fix I've just committed doesn't solve the problem for you.

TiloGit commented 5 years ago

hi thanks for working on this. But issue still exist. I notice issue on my target (php 3.0.11 ish) but works fine on this example (https://www.phpbb.com/community)

Will shoot you a direct email with the target URL I try in case you want to look into it. cheers

lairdshaw commented 5 years ago

Have responded to your email - briefly, your target forum uses a custom skin (theme), and FUPS generally doesn't support custom skins. Have suggested a regex that might work anyway. Am happy to accept a pull request if you develop a working set of regexes for it.

SanZamoyski commented 2 months ago

Hi!

Can You suggest regex for me, and most important WHERE to put it?

forum_type=phpBB
base_url=https://www.t4-forum.pl
forum_ids=16
login_user=sun
login_password=changed_ofcorse...
start_from_date=2009-10-17 19:466
php_timezone=America/Los_Angeles
download_images=1
non_us_date_format=0
debug=1
delay=0

Best regards!

SanZamoyski commented 2 months ago

I think I got it. In generic_new, in classes/CphpBB.php I added: 'last_forum_page' => '(<strong>(\\d+)</strong> z <strong>\\1</strong>)Us',

lairdshaw commented 2 months ago

Yep, it looks like that would work. It's language-dependent, so we'd need to tweak it if you wanted it to be incorporated into the official version of FUPS, but if you just need a tailored solution for your personal use, then that's fine.