jewbmx / ScraperWork

Somewhere to toss scrapers for scrubs which need testing, checked, fixed, or trashed lol.
1 stars 2 forks source link

Scrapers I have checked and confirmed working.. #3

Open JayDee696969 opened 10 months ago

JayDee696969 commented 10 months ago

I went through all the scrapers and narrowed down the list to these. Ive checked them one at a time and can confirm these all work.. Screenshot_2

jewbmx commented 10 months ago

Guess your a bit off my thought wave like jayshomebrew was lol. When it comes to other folks assistance on scraper work i only need yall to either test the duds i put in this repo or maybe make some lmao. Besides that it would be help with sites/urls i cant load on my end :)

For your list every one of them listed has already be checked and updated by me over a week back so you sorta wasted your own time. Although its still nice that ya wanted to contribute <3

JayDee696969 commented 10 months ago

Guess your a bit off my thought wave like jayshomebrew was lol. When it comes to other folks assistance on scraper work i only need yall to either test the duds i put in this repo or maybe make some lmao. Besides that it would be help with sites/urls i cant load on my end :)

For your list every one of them listed has already be checked and updated by me over a week back so you sorta wasted your own time. Although its still nice that ya wanted to contribute <3

Well, wasnt sure you knew since these seem to be the only ones that actually do work that are in your "Working" folder. I have no idea what youve checked and updated because the last update shows as 5 months ago. Makes for much faster scraping. I will test the duds on my end and see if any work for me.

jayshomebrew commented 10 months ago

@jewbmx Here's my first attempt at a scraper. 1movies_la shows that I tested: slow horses, the marvels, oppenheimer, reacher Not great sources, but seems to work for me.

jewbmx commented 10 months ago

Nice looks like you did how i do and picked a good one that matches your desired style and pimped it out till it ran. My only note is that you can probably change the end removing the hoster bit and whatnot since you dont seem to be doing anything with it. Which also means you can likely reduce the final links/results code a little too thereby saving a wee bit of runtime lol. If you like it and it works i can toss it into the addon or you can keep it for yourself and delete your evidence lmao

jewbmx commented 10 months ago

That scraper/site is nice too👌 i like its results in both movies and shows lol. Brings back a bit of the variety of hosts that has been slowly going away

jayshomebrew commented 10 months ago

Thx for the feedback, as you can see, I'm just fumbling thru this trying to help anyway I can. yeah, if I can get a fix to you in time, please add it into the mix.

jewbmx commented 10 months ago

It looks like you have done well so far and on the final steps of the scraper making process. Now all thats really left is to log a bunch of searches with like my url prep def's log line in order to see as many of the sites source results in a list. Then go thru them to see if any are needing to be added to resolveurl for current and new resolvers, or to make them get scraped further in the scrape_sources module, or scraped further/special in the scrapers resolve def or sources def with maybe a direct source type item lol. Not sure if this all makes sense but i can likely clarify later if needed lol. Maybe i can make a helpful readme sometime on how i roll when making a scraper also. Regardless tho id do the resolve work on the scraper real quick then if that one hoster code in the scraper still isnt being used id clean up the code a little clearing out the little things that arent really being used.

jayshomebrew commented 10 months ago

soo, for example, go thru this log and compare to this and look for uniques?

jewbmx commented 10 months ago

Well i log a bunch of movie and show searches to make a list like in your log then i toss that list thru this https://www.textfixer.com/tools/remove-duplicate-lines.php with the alphabetical order enabled and then i toss it back into the log file and clean out any unnecessary content like error lines or the log messages so its just a list of the urls lol. After that i usually use a search files for the domain to search resolveurl quickly and remove any that come up there and what ones i already scrape in my module.

But it looks like you went a even smarter route that i never thought of and logged the full host list lol.

When you run into ones that are not in the list tho thats when you open em in the browser to see if they are a new domain for a current resolver like a new doodstream domain, or if they are a new embed site that you need to scrape a little more to get to its sources, or if its just direct type of links that need resolved by hand in the addon, or sometimes there are just some that are junk spam type of shit and worthless to us which we just ignore and move past like we do with the random mistakes the scrapers make pulling tmdb links and shit that are not even sources just pics and whatnot lol.

jewbmx commented 10 months ago

Oh by the way if you contact gujal thru resolveurl on github or keybase you can simply share a link and info with him and he will likely help ya out about it even it its a new resolver or garbage. He is a master of that area and helps me get thru some troubles at times lol. Its also a good way to share the new domains if you suck with the regex pattern updates or have poor faith in your abilities in that area lol

jewbmx commented 10 months ago

here is a version of your scraper that i modified a bit, its a lil more like how id do it more or less lol https://github.com/jewbmx/ScraperWork/tree/main/temp i didnt test it much tho and i also didnt do the host end of the scraper, im too high for all that lmao

jayshomebrew commented 10 months ago

something like this helpful? https://02tvseries.cyou 'not found' https://5190.svetacdn.in 'not found' https://cdn.jwplayer.com 'not gonna work' https://d0o0d.com 'doodstream.com' https://enterp.online 'needs to scrape a little more', I think... . .

jayshomebrew commented 10 months ago

here is a version of your scraper that i modified a bit, its a lil more like how id do it more or less lol https://github.com/jewbmx/ScraperWork/tree/main/temp i didnt test it much tho and i also didnt do the host end of the scraper, im too high for all that lmao

yeah, thats much better.

jewbmx commented 10 months ago

Well you wanna keep the full urls intact then open em as a normal page and source view then scour the code to try and find the sources. Doing a basic domain load is handy too tho when trying to spot current resolver sites new/alternative domains.

As for your current list shown they probably all have their own tricks to hiding content and could also need more stuff to gain access to them pages like cookies and referrals and whatnot.

An example is view-source:https://enterp.online/themes/pirate/js/player.min.js?v=1.3.1 located near the bottom of this link... view-source:https://enterp.online/embed/movie?tmdb=1892 Assuming this link example matches your real link result lol. The .js link sorta shows you the next steps needed and points out the details needed from the embed link. You can basically skip all them kinda results tho and just stick to the normal resolveurl hosts and continue finding the new domains, then there is also the random embed hosts i scrape further in the scrape_sources module that sometimes have new domains that work with them too. A good example of those is the random sources you see that are one domain but two versions of links like embedplus or some shit. The domain names dont come to mind tho so you would have to look at the module and try to spot any bits that might relate to your list.

We both could actually chill on the host work too and just consider that scraper done for now, since imma be doing the host work for all the active scrapers all at once this scrapers results will likely be in the mix as well. Tho i will probably still need you or someone else to do some url loading and page source code grabbing for the troublesome cockblocked sites my net flags lol.

jayshomebrew commented 10 months ago

Yeah, I can do some page source saves, no problem. FWIW I appreciate all the insight into how this all works and all the time it takes to test everything. Scrubsv2 continues to be the best addon for fam and friends.

jayshomebrew commented 10 months ago

sources.py:594: timeout = '10' supposed to be timeout = 10 or pass? not really an issue, as this always is set via settings.xml.

jewbmx commented 10 months ago

Nice catch on that minor error/issue but its actually supposed to be rewritten as "_timeout =" so the "except:" could be written as "_timeout = timeout" since that def already has the timeout bit in its start lol. I forgot to make the change a while back and this makes me wonder how long its been that way lmao

jewbmx commented 10 months ago

@jayshomebrew I tossed a new scraper i was working on into that temp folder where the other version of your scraper is. If you get bored you can try to finish it for me lol, i stopped modifications at the return line and left some gaps between code written and code that is from the "template". The movie side is pretty much functional besides needing host type of work and them final lines of the code likely need some cleaning afterwards. As for the tv shows bit i hit a dead end where the next page is cock blocked for me by my net provider lmao. But basically i cant code anything else for it because everything left is cock blocked and the scraper is useless for me on my end, so its either someone makes it work for the addon or the whole thing gets thrown away and forgotten lol. Im already moving onto the next site in my url list so its whatever 🤪

jayshomebrew commented 10 months ago

the regex's are tough for me, but I'll give it a shot. The r_url for a tv show gave me 99 link results for me. r_url This was saved from ffox: 'https://epxmovies.net/tv/84958-s2'

.nevermind.

jewbmx commented 10 months ago

For regex patterns you can use something like this to craft and test em... https://regex101.com/r/6sjcce/1 That page there was tossed to me for the same reason back in the day by jsergio who used to be the resolveurl main dev.

I think i deleted my saved html bits for all the pages key points but if i recall the tv show side is coded differently and has like a set number of prefab spacers which are meant to match the appropriate season then episode, both coded a lil different. What i did was some url replace bits to bypass a little page loading and skip them odd processes. So right now it does the search page normally but then skips to the more final steps and closer to the real sources with less effort. For movies that final re.blah should be adding the sources/servers bit into the links list but i think most were scrape further type of links if not all of em. And as for tv shows that last re.blah and links list should show their odd episode prefab link list which i finished with another odd url trick checking for the appropriate season and episode bit s01e05 lol. So whats left is to do the next scrapePage line for that urls html then go from there. This scraper/site might not be worth the hassle tho if ya ponder it all lol. But its getting hard to find decent sites for tv show scrapers so idk, i probably made 15 new scrapers to use and i think there was only like 3 or 4 that had tv shows on their sites.

jayshomebrew commented 10 months ago

I gave it a go, and got the tvshows to work. The problem is I get stuck near the end with the links that I can't seem to scrape. Some of these go to streambucket.net or smashstream that looks like this. do I keep digging use that link ? ie https://streambucket.net ?

jewbmx commented 10 months ago

Id probably just scrape what ya can and leave the others to be ignored. As long as you get some decent sources the others could be tinkered with later on. What i do is usually try to add a note that the scraper has other results that need more work. You could also try to do a search for their domain in github and see if someone has cracked em lol.

jewbmx commented 10 months ago

@jayshomebrew I uploaded some results from scrapers that i cant seem to load. If you wanna you can tinker with em or toss me some page info. I think the links in the file are all embed type of urls which will hold sources that resolveurl uses or urls to the video items lol. Also you might wanna know that ive done some changes to the scrape_sources module so a few of the links might work now if they fail elsewhere.(like 2embed and whatnot that scrubs already processes.) I eventually might need some testing done to ensure all them scrape_sources defs work properly too tho because most them sites wont work on my end so i cant tell if they stop functioning properly lol.

jayshomebrew commented 10 months ago

Take a look at this htmls.zip file for the results.

I tried my hand at a couple other scraper sites but got stuck at the end final links, for instance bflix_sx.

v2.vidsrc.me under the scrape_sources doesn't seem to work for me, as it looks it wants vidsrc.xyz links now.

jewbmx commented 10 months ago

That bflix site should work with my primewire_mx scraper which might be one i made recently so i will upload it into the temp folder for you so you can try using it lol. I will also check out your zip, thanks for the help 😀

Added Forgot to mention that the vidsrc stuff probably needs updated with the new domains added into all its bits, you can try updating it if ya want or toss me the html with the links saved above the code as a guide even tho i cant use em lol

jayshomebrew commented 10 months ago

primewire_mx script works, but the final ajax links (rabbitstream.net) never seem to work for me. Here is what the web page looks like: image. and the html

It appears that the ajax/get-link seem to be outdated based on the actual page links: Upcloud/vidcloud/voe/doodstream/mixdrop.

jewbmx commented 10 months ago

Yea rabbit and its sister site both seem to pop up alot lately but are cock blocked like crazy so they are basically broken, but thats a resolveurl issue so id just disable that resolver for now to make life easier or keep my ghetto spam filter enabled to hide em. As for the site code i will do a few more tests with it and focus on shows a lil more to see if i made any errors or spot issues i didnt see before. But tossing that scraper on here was just to help you with a decent template since that bflix_sx site looks like it uses the same code style altho it might need a couple tweaks if anything differs. Id start by a quick rename and swaping the domain in where its needed then doin a test run.

jayshomebrew commented 10 months ago

bflix_sx creates the same crummy links to rabbitstream.net, bummer.

jewbmx commented 10 months ago

Yup probably lol but still worth keeping around incase the sources ever change to something we can use. Incase your curious tho the upcloud and vidcloud servers are usually for that rabbit doki crap, they are listed like that on a bunch of sites lol

jewbmx commented 10 months ago

Heres another simple site thats like a sister site of the 2embed.cc if you wanna try to make a scraper with it... soap2dayto.xyz it seems pretty simple and could even be made with hand crafted urls to the source page using the base_link + '/embed/imdb_id instead of search page to movie to embed page. Might have to check the show side for differences tho i didnt look that far since the next page after /embed/ is blocked on my end i didnt bother any further lol

jayshomebrew commented 10 months ago

Yup probably lol but still worth keeping around incase the sources ever change to something we can use. Incase your curious tho the upcloud and vidcloud servers are usually for that rabbit doki crap, they are listed like that on a bunch of sites lol

primewire_mx: Although the rabbit links don't work, the upcloud links do work in ffox. upcloud and vidcloud actual links here. I just don't know how to scrape primewire_mx for these actual links. Probably not worth it, but just wondering if you know another way.

soap2dayto.xyz

ok, I'll check it out.
Update: yeah, this was easier for me, but still the site just ends up with embed URLs, so not really worth it. ie. vidsrc.xyz/vipembed/tv?imdb=tt8119642&season=1&episode=3 or 2embed.cc/movie/tt1517268

bjgood commented 8 months ago

upmovies_to had a domain change. It's now upmovies.net (from upmovies.to). Once corrected it works fine. Later...

jewbmx commented 4 months ago

Imma post this same msg on each issue so anyone involved in the scraper stuff of the addon can see it lol... If any of yall have scraper/site stuff you want added into the addon, feel free to make a new comment in whichever issue ya wanna and include said stuff or if its files add them to my scraper work folder/repo thru a fork or whatnot and i will gather all the stuff up and go thru it during my update process that imma be beginning soon. I will wait until around july 1st to start so yall can have these next 10ish days to see my comment and do what ya do lol :)