Open Mahelita opened 3 years ago
Some first quick results in python 3.x:
import requests
from bs4 import BeautifulSoup
url='https://tonies.de/shop/tonies/die-drei-fragezeichen/der-super-papagei-limited/'
r = requests.get(url)
soup = BeautifulSoup(r.content)
titlelist = soup.find(id="tabs--large-up__titelliste")
[a.extract() for a in titlelist.children]
titlelist = [a.get_text() for a in titlelist.children][1:]
print(titlelist)
['01 - Ein Hilferuf', '02 - Ein Papagei spricht Latein', '03 - Schneewittchen ist verschwunden', '04 - Ein unverhoffter Besuch', '05 - Blackbeard, der Pirat', '06 - Die rätselhafte Botschaft', '07 - Von Steinen und Gebeinen', '08 - Blackbeard hat das letzte Wort']
So getting the track names is super easy, given that the tonie has tracks, which is not always the case. But getting the urls matching the JSON entries might be a bit more difficult.
Next update: I can now get nearly all urls to the page containing the track names of german tonies. Some urls do not follow the logic I implemented. Fortunately the exceptions are few and could be handled explicitly. (List of wrong urls below)
from bs4 import BeautifulSoup
import json
import numpy as np
import re
import requests
url = 'http://gt-blog.de/JSON/tonies.json'
data = requests.get(url).json()
url_base = 'https://tonies.de/'
special_char_map = {ord('!'): '', ord('?'): '', ord('’'): '', ord('&'): '', ord('.'): '', ord(','): '', ord(' '): '-', ord('ä'):'ae', ord('ü'):'ue', ord('ö'):'oe', ord('ß'):'ss'}
series_names_url = []
episode_names = []
episode_names_url = []
series_urls = []
for tonie in data:
if tonie['language'] == 'de':
series_names_url.append(re.sub('--+', '-', tonie['series'].translate(special_char_map).lower()))
episode_names.append(tonie['episodes'])
episode_names_url.append(re.sub('--+', '-', episode_names[-1].translate(special_char_map).lower()))
series_urls.append('https://tonies.de/shop/tonies/{series}/'.format(series=series_names_url[-1]))
series_urls = np.unique(series_urls)
all_episode_urls = []
for url,series,episode in zip(series_urls, series_names_url, episode_names):
r = requests.get(url)
if r.status_code == 200:
soup = BeautifulSoup(r.content)
href_all = soup.find_all('a', href=True)
request_urls = [href['href'] for href in href_all]
episode_urls = [a.find(url[18:]) for a in request_urls]
episode_urls = [0 if a == -1 else 1 for a in episode_urls]
all_episode_urls.append(np.unique(np.array(request_urls)[np.where(episode_urls)[0]]))
else:
print(url)
https://tonies.de/shop/tonies/der-raeuber-hotzenplotz/
https://tonies.de/shop/tonies/die-maus/
https://tonies.de/shop/tonies/heule-eule-und-andere-geschichten/
https://tonies.de/shop/tonies/kosmo-klax/
https://tonies.de/shop/tonies/kreativ-tonie/
https://tonies.de/shop/tonies/nola-note/
https://tonies.de/shop/tonies/rotzn-roll-radio/
Now I need to get the track name scraper working again (stupid AttributeError: 'NavigableString' object has no attribute 'get_text'
).
But I already saw that there will be some difficult to handle exceptions where track names are given only for blocks of tracks.
Hallo Mathelita, good job. That is the missing part for my JSON file. The Track feature is included within my JSON structure and within TeddyBench. But until now I had no time to look up all the track names. Wanted to write a scraper by myself with php, but because of no time, the track names are mostly missing.
Can you write the scraper in php? I would include this into my tonie db structure to include the track names in future times in the JSON.
Just get in touch with me. You can find me within the RevvoX Chat as well (did not know what your nick is in there...).
Regards, Gambrius
Update to anyone wondering what the status is: Currently spending my me-time working on another scraper project.
@Gambrius
Just get in touch with me
Done ;-)
Update: I tried to finalize the python version of the scraper but at the moment I am unable to match the tracks to the tonies in the json file... This is unfortunately due to discrepancies between the episode name of the json and the episode url. I will look into fuzzy string matching. Hopefully sometime soon.
Update: using fuzzy matching I can correctly assign most urls to the tonies in the json. There are still a rather large amount of errors and I need to determine a fuzzy matching threshold which needs to be exceeded for the match to be accepted.
Update: I could not resist... Threshold > 64 has perfect specificity at the cost of sensitivity, aka., no miss-matches but less true matches.
I think the source url should be also saved within the json in the future. So the fuzzy matching would only be needed for initial mapping. So great solution. I see there is a double slash in the url for //shop This isn't a problem because it works but it may in the future.
Great work @Mahelita
Extracting the Track names will be tricky, as there are cases like "01 bis 04" etc. I think removing the numbering / moving it into its own attribute is a great idea.
Update: During the last coding evening I noticed some strange fuzzy matches for specific tonies. It turns out I was too tired to code properly. Thankfully I was not too tired to spot strange behavior. With fixed code I only have a few tonies which get wrong matches. Again, this is strange as these look very easy to match. So I'll be hunting for the next bug during the next coding evening! If I cannot find any I'll just handle these few separately...
I also noticed that some tonie shop pages format the track section without using a <p></p>
for each track name, which at the moment escapes my scraper. Here an example: https://tonies.de/shop/tonies/lieblings-klassiker/robinson-crusoe-und-vier-weitere-klassiker/
Update: I found the (last?) bug! While matching json and scraped tonies I was overwritting good matches with less good matches which did not have a correspondance. Now that I verify that a new match has a better score than the existing match it looks like there a no longer any mistakes. @Gambrius This also shows that http://gt-blog.de/JSON/tonies.json is no longer up-to-date. Possibly an update is in order? Well, this part of the track name scraper took me quite some time. Lets hope that it will be future proof! (I fear not -_-) Unfortunately were are not done yet... I am aware of the following issues that still need to be handeled:
But first I will clean up my code a bit and post it. It has been a while...
No code cleaning done yet but still sharing the code as the cleaning will probably only happen when I will find time to tackle the next item in the above todo list. https://gist.github.com/Mahelita/a6a934071f926a944d57ad0c6c99852d
There no such thing as a "last bug" ;) I appreciate your attempt, but found
As an alternative, I'm now looking into tonies.club/tonie/all - which might provide better information. A first check provided no "ranges" (e.g. for "Was ist was" episodes which are notorious for subsections). I haven't spotted other vital info (like hash) yet, so this may be an exercise only.
I had to abandon my "exercise", and return to the original idea of scraping tonies.com instead.
Please see https://gist.github.com/steve8x8/db659463c5f86a1649f2a21c4aacc4b4 for the first public take - en-US tonies seem to be handled differently, and there are indications of typos in the original json. (I kept some of the original code from @Mahelita as comments.) removed
The result is still somewhat incomplete, as series queries sometimes do not return the same information that would be visible in the browser (see de-de/tonies/anne-kaffeekanne for an example). TBH I haven't yet fully understood how the fuzzy matching works. Still this code (and resulting json) might be a starting point for adding TRACK information in Teddy and TeddyBench? (#29)
NB: I found it useful to pass the resulting json file through json_pp
for better readability. Caveat: This adds about 1/6 in file size ;)
2023-01-12 versions, with correct track lists (no more ranges):
scraper-eu.py
(for de-de, en-gb, fr-fr tonies officially listed):
https://gist.github.com/steve8x8/743ffdbed914b2c47ed038672698a34d
scraper-us.py
(for en-US officially listed):
https://gist.github.com/steve8x8/35a39eb27d7db9ca245d9c30d5563ac8
Both scripts output tonie counts, total and new (also - to stderr - a list of new tonie titles). They need They may throw a list of yet unhandled Unicode representations if more of them are added. (Suggestions how to clean up the cleanJson() function are welcome.)orig.tonies.json
to fill in several fields which are/seem otherwise inaccessible.
${lang}_tonies.raw.json
should be passed through json_pp
, e.g.:
json_pp < de-de_tonies.raw.json > de-de_tonies.json
for readability.
@Gambrius (and/or @SciLor ?) #29 now has all it needs for filling TRACK information, I think.
Final update for this week: https://gist.github.com/steve8x8/f84228c76debda1e7a4a54835e7378e7
There may be an issue with tonies that can be assigned new episodes ("Benjamin Blümchen" for example) - they will keep their model
but change audioID
and hash
- and tracks
:-(
Okay, I obviously "forgot" to include essential information in the previous versions (thanks to @Gambrius for motivating me).
There was some time to add stuff from tonies.club which keep some records of abandoned releases, and to look into the US and HK servers (where I failed to access creative-tonies lists...)
The code, I hope, is legible enough to be modified to
Here you go.
Time for yet another update. I also made a scraper for tonies.club who have some additional info - if a file tc-tonies.json
is available it will be used. (Note that both scrapers produce .raw.json
files which have to be renamed or beautified with json_pp
.)
Here you go. Feedback is welcome.
Time for yet another update. I also made a scraper for tonies.club who have some additional info - if a file
tc-tonies.json
is available it will be used. (Note that both scrapers produce.raw.json
files which have to be renamed or beautified withjson_pp
.)Here you go. Feedback is welcome.
I get a 404. Why don't you put it into a repo? So your changes are visible =)
https://github.com/steve8x8/tonies-scraper - now with some helper scripts using Teddy.exe under mono and opus2tonie
Hi community,
has anyone already written a scraper to extract the track names of Tonies? E.g., from https://tonies.de/shop/tonies/die-drei-fragezeichen/der-super-papagei-limited/#tabs--large-up__titelliste I am looking into this as Gambrius' JSON (http://gt-blog.de/JSON/tonies.json) is missing nearly all tracks and I would like teddy to include the track names into the .ogg metadata, to then add it to the backup filenames (see issue #29 for the latter). If no one has done it already, I will try to do it myself and share the code here.
Thanks!