labster / taparip

Rip threads from a Tapatalk forum into a Sqlite3 database
MIT License
10 stars 3 forks source link

Command Line Error #1

Closed IFireflyl closed 1 year ago

IFireflyl commented 4 years ago

I'm seeing the following message when I run this:

Gathering data from https://tapatalk.com/groups/lionheartforums/looking for thread t=6413&start=0HTTP error: 301 Moved Permanently at taparip.pl line 122.
        main::download_thread(6413) called at taparip.pl line 99

The only things I changed in the taparip.pl file was this:

# Site Configuration
#     http://domain.yuku.com/viewtopic.php?t=11571&start=0
my $domain         = 'tapatalk.com';
my $api_path       = 'groups/lionheartforums/';
# This is the path of where the database file will be created
# (If you use a relative path, you should keep running it from the same working directory)
my $db_file        = "LHF.sqlite";

Any thoughts on what I'm doing wrong?

labster commented 4 years ago

You might not be doing anything wrong. Let me take a look, it's quite possible they changed nginx config on the Tapatalk side and my code is too dumb to handle it.

labster commented 4 years ago

Ah, I see the problem. You need to set your $api_path to 'groups/lionheartforums/viewtopic.php'.

Sorry it took so long for me to get back to you. Let me know if you have any more problems, and I'll try to be faster.

IFireflyl commented 4 years ago

You didn't take that long at all! I also had to change the $domain path to include the "www." at the beginning. Now I'm getting this message:

Gathering data from https://www.tapatalk.com/groups/lionheartforums/viewtopic.phplooking for thread t=635&start=0 - downloaded - School is almost out. Can't call method "child_nodes" on an undefined value at taparip.pl line 178.

IFireflyl commented 4 years ago

Any thoughts on this?

James-Gryphon commented 4 years ago

Similar to the other thread, if the script mentions an 'undefined value', there's a good chance (although not a perfect chance) that, in the page it read, it can't find the content it's looking for.

This is, I think, the relevant section of code:

# Tapatalk hides the post title in an link in a comment, WRYYYYYYYY . ('Y' x 40)
        my $titlecomment = $post->at('.postbody > div:first-child > h3:first-child')-# >child_nodes->first->content;
        my $posttitle = Mojo::DOM->new( $titlecomment )->at('a')->text;

The reference to the post title tipped me off. I looked for it in a present-day Tapatalk page, and couldn't find anything remotely like that anymore. My best guess is that Tapatalk has completely obliterated that section (and, of course, years and years' worth of people changing post titles). So, the script looks for it, doesn't find it, and crashes.

My hack, which gets the script past that section (to another point where it inevitably fails):

 # Tapatalk hides the post title in an link in a comment, WRYYYYYYYY . ('Y' x 40)
#        my $titlecomment = $post->at('.postbody > div:first-child > h3:first-child')-# >child_nodes->first->content;
# May be a moot point now -- it looks like it's been totally removed!
#        my $posttitle = Mojo::DOM->new( $titlecomment )->at('a')->text;
my $posttitle = '';
labster commented 4 years ago

Oh damn, I forgot all about this. Hi all. Thanks for the investigation, @James-Gryphon. Can you tell me where your new failure is?

James-Gryphon commented 4 years ago

I wrote out a long step-by-step description of the changes I made, then apparently accidentally visited another page and lost all of them... I can't say I'm very happy with that, since it provided a lot of relevant detail.

To provide a short summary, I figured that many of the problems were being caused by things that TP removed or changed since then, so I went through either providing placeholders or finding equivalents in the current page. Here's a link to a gist I put up of the current script.

The problem at this point is that it isn't saving the posts. The terminal log reads something like this (real names/etc. hidden):

james@... perl taparip.pl
Logging in with user: Admin, session_id: #####
Login successful
Gathering data from https://www.tapatalk.com/groups/generic_board//viewtopic.phplooking for thread t=1996&start=0 - downloaded - Hi
39486
0
39678
1
2 saved
looking for thread t=1999&start=0 - downloaded - All data has been carried over
39681
0
39682
1
39683
2
39684
3
39685
4
5 saved

...but in the actual DB file, although it creates the threads and users properly, the posts table is completely empty.

I may have broken something when I was experimenting with $pid. I wasn't confident it would work, but pre-change it kept throwing these errors:

Use of uninitialized value in substr at taparip.pl line 170.
substr outside of string at taparip.pl line 170.
Can't call method "text" on an undefined value at taparip.pl line 171.

...so I figured that I'd try something and see if I could get past that, hence how it got to where it is now in the Gist.

I'm a complete Perl novice, so I don't expect you'll be very impressed with the status of the script after I hacked at it, but hopefully it can be a stepping stone to a current solution.

labster commented 4 years ago

You're doing OK at editing so far... Keep in mind that most of this is using Mojo::DOM, which is documented here: https://metacpan.org/pod/Mojo::DOM

In your gist I see this...

        my $pid = substr( $post->tag('div')->attr('id'), 2);

I'm not sure what you're trying to do there, but you're taking the element in $post and setting that element to be a <div> tag, which is probably not what you want. We're just extracting data, not trying to modify the DOM. You possibly want at or find? (The substr( ..., 2) just means throw out the first three characters, probably pid or something.)

And the undefined value... it looks like the user's postcount isn't even there. I pulled up a Tapatalk page, and it looks like I could find it in jQuery using $('.user-statistics:not(.popup) span span') -- can you try that selector instead?

James-Gryphon commented 4 years ago

So the $pid is bad - I thought it probably would be. I think the hope there was that tag was some kind of tag-specific selection mechanism. Looking at it again, and using your original line, it looks like it basically works, so I'm not sure why I felt the need to try to change or replace it. I might've gotten it confused with a problem later on, I suppose.

Is $count referring to the user's post count, or to the post's count (that is, its position in the topic)? I assumed the latter; it looks like $post_count is set separately (near the discontinued '$rank'), and in the context of looking at the post, it seems to be a good fit for there.

labster commented 4 years ago

Oh, sorry, you're correct, it's the current post's count in the thread. Which should always exist under .author a span. If the value is undefined, I wonder if they're inserting an advertisement there or something? With uBlock Origin I never really know what the web is like for everyone else. I think you can just do return; to skip processing that post and exit the anonymous subroutine in each() if you determine the row is bad.

James-Gryphon commented 4 years ago

It looks like .author a span should do the trick.

I found that apparently the timestamp parsing is broken (no thanks to Tapatalk). The database requires a 'not null' value for that slot; after testing it by turning that requirement off, it seems that's the cause of post saving being broken.

It also looks like user signup dates are somewhat broken; it does capture the month and day, but not the year, and not the hour and minute. Tapatalk's preferred method these days is to hide all of that in a title attribute, with an irregular format, at that. If it can be made to work for one, though, it should work everywhere else date-time extraction and parsing is needed.

So, if the timestamp reading and generating is fixed, it looks like the script should be set up to work for at least early Feb 2020 (until TPT breaks things again).

James-Gryphon commented 4 years ago

Thought I'd bump this to note that after a few man-hours of novice Edisonian-style labor, I've finally put together what appears to be a working, updated version of the script. Have a look at the new revision of the Gist. If it tests well and you don't see any glaring problems with it, it'd be nice to see these changes get pushed to the main branch, and we can finally lay these issues to rest.

labster commented 4 years ago

Looks nice except for your aversion to indentation. I guess you're not a Python programmer. Why don't you make a pull request?

I much enjoyed your AJAX adventures.

Just for curiosity, do you know why there are more people interested in this script now? Did Tapatalk buy someone new? My forum all bailed about two month after we got moved from Yuku, and never looked back. Which is why this software isn't maintained much by me -- that, and it had to have an Edisonian design in the first place because it's all workarounds to how they desecrated PhpBB.

James-Gryphon commented 4 years ago

Most of my labor day to day is in PHP... I guess you know what we're like. ;) As for the pull, I'll get right on it.

The edit history thing they do now is more powerful (unusually for Tapatalk, the same company who removed manual time zone settings), but depends completely on JavaScript. It was a blessing to find that interface was exposed.

A little while after taking over Yuku, Tapatalk acquired InvisionFree and then ZetaBoards, and gradually converted all of them, in a less-than-upfront manner. There was a bit of a scramble to escape when ZetaBoards was still open (enabled by nneonneo's excellent crawler), but the deadline for that came and went.

Now time's gone by, Tapatalk's artificial limitations are getting more and more obvious and obnoxious, and I think it's fair to say there's a lot of people who would leave if they knew they could, but they don't know any way to bring their accumulated data out. If the previous exodus was like rats fleeing a sinking ship, now it's something like trying to rescue passengers trapped in a submarine. It seems like this is the sort of situation GDPR should have actually been able to help with, but I guess those things never work in a way where they'd be useful.

My personal experience: a friend of mine knows at least one major site that left the Tapatalk system cold turkey, I suppose without being aware of their conversion options, and was curious if I could help them now that their old stuff has been fully Tapatalked. I have a little experience merging forums and converting them to other platforms, but I'm still a bit more of a hacker (in the sense of Dr. Frankenstein) and a script kiddie than a true programmer, so this script's existence was very important to me as a starting point. I expect the story is similar for others.

IFireflyl commented 4 years ago

@James-Gryphon your script is working. I did eventually encounter an issue with it.

looking for thread t=466&start=0 - downloaded - Can't call method "text" on an undefined value at taparip.pl line 159.

As someone who doesn't do much with code, I have no clue what would have caused this error. Plenty of other threads downloaded without issue.

James-Gryphon commented 4 years ago

I don't think I did anything with line 159 (because it never tossed that error when I tried it on my test forum), but we can definitely have a look and figure out what's awry with it.

my $forumid = $dom->find('#nav-breadcrumbs .crumb:last-child')->last->attr('data-forum-id');

I hate to violate our comfortable anonymity, but would it be possible to send a link to your forum, so I can take a look at t=466? If you don't want that, I can try to walk you through the inspection process, or do blind testing on my own, but this might be quickest. (If you do, you should probably email me or something instead of putting it up here; I doubt Tapatalk has people sitting around watching this page, but even so, "the walls have ears", and although I think we're morally on good ground, what's right and what's done don't always overlap.)

UPDATE: I think I might see where the problem lies. That topic wouldn't happen to be on a child board (a board inside another board), would it?

James-Gryphon commented 4 years ago

Looking at it again, it looks like line 159 is different from what I was sure it was. I suppose I read it from the original code, which does have that line at 159, instead of my modified versions. Times like this I feel like I'm losing my mind.

At any rate, another test forum is showing trouble, tripping up at another part of the import, so I suppose it's back to the drawing board. The script mods I put up might work for some cases (like it did for my original test board), but I can't guarantee it will help everyone yet.

You're still welcome to get in touch with me about the details of that particular topic, though. Finding one like it on another test forum might be like looking for a needle in a haystack, so looking directly at it is still likely to be quickest.

James-Gryphon commented 4 years ago

Ran the script a little while and got the error to come up. It looks like it could be that the topic is on a staff/restricted forum. The original script was intended to keep track of unauthorized threads, but I suppose that's something else that was broken. I guess I didn't notice it earlier because I typically ran the script as a user with admin rights. A patched version should be coming right up at my fork. Try it out and hopefully this time it will work for you; if not, let me know and we'll keep working at it.

labster commented 1 year ago

I'm not really sure where we are on this one, but I merged #3 and perhaps it resolved the original issue? If you're still having problems, or even still care, feel free to reopen the issue or create a new one.