Open RavanJAltaie opened 1 month ago
Recipe created https://farm.openzim.org/recipes/shamela.ws_ar_all I'll update the library link once ready. I already sent them an email to double check if any of the books in the library has a copyright.
We've received an answer from the team that all the books in the website are more than 100 years old. In 20 years (their operation time) they've received only 2 claims of books copyrights and they've deleted the books immediately as per their website policy.
is that means it won't get crawled?! I have requested a zim file related to this website here[#986] I think it's public domain. could this be made?
@hamoudak no no, this means we're good. You can follow the task on the link given above.
thank you, it's an valuable website for reading and studying. sorry I was confused, so their website policy to be free.
After 3 days, crawler progress is 3% (100753 / 2859505). 2.8 million links to explore is way too much. I cancelled the task and disabled the recipe. We need to find another way of ZIMing this website, this is not feasible with zimit, at least as-is.
@benoit74 could my request [#986] be created; its one of five archives related to this domain, or I have to wait for some reason.
Also, I may suggest for the library to continue scraping it with Zimit but to be divided into 40 categories as it is on the website or to be divided by your side.
The idea of dividing the ZIM per category as on the website is a good one.
And looking a bit more into it, I don't get why we ended-up with 2M links.
Anyway, I've started a first sub-recipe of category 34: https://farm.openzim.org/recipes/shamela.ws_ar_34
In this new recipe, ZIM name, title and description are very bad, this will have to be fixed, but at least let's see how it goes.
I can give you the names of these categories in arabic ; I know arabic very well. this category called: [ al-shir-wa-dawawinu] poetry diwans. arabic : الشعر ودواوينه
why it ended with millions of pages; beacuse there are some books has many volumes like dictionaries, literature, explanations and interpreting quran. in addition to this there is a category called : sunna books (category/6) , most of this one has many links due to each page may have just two lines of text or one for a big book, so it ends eventually with many links.
why it ended with millions of pages; beacuse there are some books has many volumes like dictionaries, literature, explanations and interpreting quran. in addition to this there is a category called : sunna books (category/6) , most of this one has many links due to each page may have just two lines of text or one for a big book, so it ends eventually with many links.
OK, so what I did for [ al-shir-wa-dawawinu] peotry diwans
is not going to work for all categories. I basically asked to explore only links listed in the category page, so if I understand you well, it will explore only the books but not their volumes.
If I get you correctly, what we would like is tell the scraper to:
For instance, for https://shamela.ws/category/4, we want the book https://shamela.ws/book/23622 but also https://shamela.ws/book/23622/1, https://shamela.ws/book/23622/2 and so on, and also https://shamela.ws/author/263 ; and so on for all other books of the category.
Is this correct? Do we have other links / pages which would be needed in each ZIM per category?
All that being said, I don't know yet how to do it with zimit, but at least it is important to understand what we would like to achieve ^^
I can give you the names of these categories in arabic ; I know arabic very well. this category called: [ al-shir-wa-dawawinu] peotry diwans. arabic : الشعر ودواوينه
Glad you can help on this, thank a lot. Once we have a working plan, I will come back to you about what we need precisely.
first it will only explore the links but not their sub-pages [the books themselves are volumes] . and you are absolutely right in all the three points you gave with examples . I have made over 200 (highly important) books of this domain with youzimit , when I did a basic crawl. I got just the titles (the contents of the book) not the sub-pages. so I went to the custom scope and gave it the right parameters to the sub-pages links. I got it work then.
I think I've achieved to build a pretty good ZIM of category 34. You can see preview at https://dev.library.kiwix.org/#lang=&q=34 (this is not the final URL, and never guaranteed to work, this is just dev server).
I'm currently running again the recipe to update the icon (which is blank) and to update the CSS of HTML pages inside the ZIM (to hide useless things when offline). What do you think?
My main concern is that it took 7 hours, which is not that bad given the scraper had to explore 7966 links, which gives us an average of 3 secs per link, but this was for only 25 books. I don't know what this will mean for a huge category like category 6 with 1227 books and very huge ones like https://shamela.ws/book/13174.
For the new task which is currently processing, I've increased the number of parallel worker to 4, let's hope it will not trigger something bad on the upstream server.
I will also retrieve all main books page to count the number of links per category and have an estimate of total time per ZIM.
Do you have any idea of how often we should update the ZIM? E.g. how often are they adding new books, or how many books are they adding per month / quarter / year?
Since it looks like we will finally have a plan, it is now time to ask for your help regarding ZIM metadata.
For every ZIM (and hence category for now), we will need:
selection
: this is what will go into the ZIM name, which will be named shamela.wa_ar_<selection>
(without the <
and >
). Since we are doing one ZIM per category, the selection
should be more or less the category. It should be as short as possible, but also as expressive as possible. It can contain only alphanumeric characters and the dash. If I understand you well, I imagine that for category 34 it should be 34-al-shir-wa-dawawinu
(I'm not sure adding the category number will help ... it would help us to maintain the ZIM at least ^^)title
: this is the ZIM title, displayed in all readers. It is limited to 30 characters. It should help to identify which ZIM the user is going to open. It must be in ZIM native language, so arabic, and user-friendly (i.e. it is not used by machines). If it was in English, I would for instance consider to use shamela.ws books: category 34
since it is difficult to be more expressive in only 30 characters. I don't really like it to be honest, it is a bit ugly, but at the same time I don't find how to fit more in 30 chars. Maybe shamela poetry diwans
? Not sure it will be possible for all categories, and it is less precise than the first alternative I proposeddescription
: this is the ZIM description, displayed in all readers. It is limited to 80 characters. It should help understand what is inside the ZIM, as a complement to the title. It must be in ZIM native language, so arabic, and user-friendly (i.e. it is not used by machines). If it was in English and I understand you well, I would probably use something like Books of shamela.ws collection, category 34 of poetry diwans
(maybe adding something about what these books are would be interesting, are they about art, religion, daily life, law, mechanics, technology, ... didn't understood this so far)Could you propose something for category 34 first? Please do not hesitate to ask friends for feedback as well on these, it is hard work and often good ideas might come from interactions with others.
you actually made it very good and a complete one. I 've download it from the farm before you post this comment. everything work as intended. for the things you'll hide it, I don't know much a bout it but the zim file as it is shown on the website is good enough.
I have noticed for a long time being on the website they adding a few books; about 7 books in a six weeks , their uploading new books is random, one time I just saw two new books after nearly two months. not much they add but working hard to add new ones.
now I think if you leave categories without numbers will be better. their names would define that, am going to work on the "metadata" tomorrow; I'll update you with anything I see. Its my pleasure to help you as much as I could .
note: I see the first zim is to be better than the updated one; its not vital like the original and I hope to keep it as it was for offline use.
I have noticed for a long time being on the website they adding a few books; about 7 books in a six weeks , their uploading new books is random, one time I just saw two new books after nearly two months. not much they add but working hard to add new ones.
OK, thank you. Updating the ZIM once per semester is hence probably a bare minimum. But I would like to avoid creating multiple ZIM in parallel to avoid overwhelming their server, so I doubt we can update once per quarter or it would mean mostly always having a ZIM update running.
now I think if you leave categories without numbers will be better. their names would define that, am going to work on the "metadata" tomorrow; I'll update you with anything I see. Its my pleasure to help you as much as I could .
Thank you !
note: I see the first zim is to be better than the updated one; its not vital like the original and I hope to keep it as it was for offline use.
I'm not sure I get you here. Do you mean that the links I hide in latest version made the ZIM worse? I'm very interested to have your feedback on this, we always considered that it is better to hide as many "broken" external links since they usually do not work in offline scenario and we've considered so far they will bring only frustration.
I will also retrieve all main books page to count the number of links per category and have an estimate of total time per ZIM.
On this estimate, as expected category 6 is the biggest one with approximately 1.1 million links. Given the speed of last task with 4 parallel workers, it should complete in about 10 days which is OK.
I have noticed for a long time being on the website they adding a few books; about 7 books in a six weeks , their uploading new books is random, one time I just saw two new books after nearly two months. not much they add but working hard to add new ones.
OK, thank you. Updating the ZIM once per semester is hence probably a bare minimum. But I would like to avoid creating multiple ZIM in parallel to avoid overwhelming their server, so I doubt we can update once per quarter or it would mean mostly always having a ZIM update running.
now I think if you leave categories without numbers will be better. their names would define that, am going to work on the "metadata" tomorrow; I'll update you with anything I see. Its my pleasure to help you as much as I could .
Thank you !
note: I see the first zim is to be better than the updated one; its not vital like the original and I hope to keep it as it was for offline use.
I'm not sure I get you here. Do you mean that the links I hide in latest version made the ZIM worse? I'm very interested to have your feedback on this, we always considered that it is better to hide as many "broken" external links since they usually do not work in offline scenario and we've considered so far they will bring only frustration.
I will also retrieve all main books page to count the number of links per category and have an estimate of total time per ZIM.
On this estimate, as expected category 6 is the biggest one with approximately 1.1 million links. Given the speed of last task with 4 parallel workers, it should complete in about 10 days which is OK.
that's good news of making cat 6. and yes I see the latest changes on cat 34 is not suitable for reading books, the white theme make it un comfortable to the eye for reading anything; I do know that you make these changes for better experience for offline use (like stackoverflow and others), but this domain is simple and it needs no changes especially for styles or themes. I may suggest that you keep both the original and the customized one, but for the last keep just the styles and themes; you can remove only the external links (if you must change something). overall. the website has a little external links not much to make changes besides its arabic; which is a different culture.
sorry I haven't read carefully that you need title and description in arabic. I was talking to my family at this time. I made most of the plan.
edit: am working on descriptions .
yes I see the latest changes on cat 34 is not suitable for reading books, the white theme make it un comfortable to the eye for reading anything
I did not intentionally changed anything regarding themes, I just removed external links. And I always had a white theme, with a toggle (which is still there) to enable dark theme. So I don't get your point. Can you share a screenshot of previous ZIM and new one, or small videos?
there's a simple issue (related to theme and links) in the old zim you've created when am browsing it shows the links and themes as it is online then when i go forward or back while browsing a book it disppears completely ! this is : shamela.ws_ar_34_2024-10 _size 24.4MG page:80
page:81 for the same file.
for the theme you removed it was the light blue one, and you're right no difference between the two zims. so it's the light blue; could you keep it, please :)
I'm sorry but I still don't get it. What do you want to keep? The light blue header on the top of the ZIM? Isn't this just a zone with a link to contribute to shamela.ws, which is not going to work offline? Do you mean that I should keep the blue zone and just remove the link instead? (I feel like this blue zone was useless if empty, but I can easily add it back if you feel like it is useful even if empty)
yes; you got my point, it will relieve tired eyes a little bit when you reading a long time . you'll see a big difference.
edit: there's coloured texts, I think it will fit with.
I almost done of the descriptions , also I have translated all category names into english ; it was the first time to translate religious idioms, so It had much time to be sure of the translated text. besides reading and searching related context articles.
edit: there's coloured texts, I think it will fit with.
do you mean the text in yellow leading to a contribute page? I can keep it as well, no strong opinion other than this link will be outside the ZIM, needing an online connection.
no I didn't mean that; I meant the texts within the book, its coloured with many colours for study and remembering. so the light blue will fit with those coloured words. remove that yellow.
I think I have finished of the metadata file , you can now have a look, and tell me what do you think or if anything should be edited etc.
edit: this is an updated one; you can work on this :
there was a simple typo on dawawin in arabic its دواوين instead, not دوواوين is that ok?
no I didn't mean that; I meant the texts within the book, its coloured with many colours for study and remembering. so the light blue will fit with those coloured words. remove that yellow.
OK, I just relaunched category 34 with that change (and fixed ZIM name, title and description). Note that recipe is now at https://farm.openzim.org/recipes/shamela.ws_ar_alshir-wa-dawawinu for consistency with ZIM name.
I think I have finished of the metadata file , you can now have a look, and tell me what do you think or if anything should be edited etc.
Thank you a lot for this metadata file! It looks ok at first sight. I might have few more questions as I dive into details category per category, but it is sufficient to get me started.
I think with the light blue it has become very good; please work on the updated metadata a bove; I fixed some typo.
do you copy book ids manually for custom scraping? or the urls included in a category? if so I can help you on some categories you need them.
there's just a notice, when you go to the author page you can't read completely the author biography; that's happening in the zim file.
@Popolechien : ZIM for category 34 is ready to move to prod: https://dev.library.kiwix.org/#q=%D8%A7%D9%84%D9%85%D8%AC%D9%85%D9%88%D8%B9%D8%A9 ; can you please have a look before I publish it?
@benoit74 Not 100% but I suspect there is a formatting issue with that black bar over here: https://dev.library.kiwix.org/viewer#shamela.ws_ar_alshir-wa-dawawinu_2024-10/shamela.ws/author/3009
the top link also redirect to shamela.ws, which is blocked as it is considered external link - any chance we can have it redirect to the zim's home page?
@Popolechien that's will be good, to be redirected to the home page, then you can go to the category from there. before this zim go to prod, one little thing. there is this sign ( ; ) for seperating sentences, its for english not arabic. in arabic must be ( ؛ ) i've edited them all here
so this zim metadata is: the arabic with english letters: alshir-wa-dawawinu title: دواوين الشعر؛ المجموعة رقم 34 description: دواوين الشعر العربي في الجاهلية وصدر الإسلام، وبعض الشروحات عليها
sorry to mention this but now everything is complete from my side.
Edit: I have finished of category 4 urls for custom scope category_4.txt
OK, sorry for the broken layout, I broke thing with last CSS, I've fixed it, running again the recipe to update the ZIM ATM
the top link also redirect to shamela.ws, which is blocked as it is considered external link - any chance we can have it redirect to the zim's home page?
Nope
i've edited them all here
OK, I'm using your last file now.
Edit: I have finished of category 4 urls for custom scope category_4.txt
Thank you, but I already have built the list for all books per categories with a script, no need to do that for other categories manually.
:) that's cool. I've checked the zim, everything working fine. just on https://dev.library.kiwix.org/viewer#shamela.ws_ar_alshir-wa-dawawinu_2024-10 but on kiwix js pwa the same problem for the author biography. I think the first zim you've created before removing external links or editing css was working with no issues.
its working fine now on "kiwix js pwa". thank you.
Thanks, new ZIM seems indeed ok now. @Popolechien anymore remark or shall I move this ZIM to prod?
Can you do a bit of CSS to remove/hide the search box?
Never mind the box works.
Other than that and if @hamoudak has no other comment I think we'll be good to go. Thank you @benoit74
"authors index" I think it needs a single zim file ! without going into the book-contents. a roadmap to the library. for this zim everything is good to go .
in config: title | العقيدة؛ المجموعة رق 1 correction: title | العقيدة؛ المجموعة رقم 1 رقم (not رق ) means: number you made it as you make it short in english with: num
I want to recommend something. you should give attention to punctuation marks in zim descriptions which is used to clarify the meaning of any written language. if you see the zim file " telmidetice" description for instance, its complicated, and difficult to the reader to read.
description: التعليم عن بعد المستوى الأول القراءة تصفية الصعوبات القرائية ب ت ث ن ي correction: التعليم عن بعد، المستوي الأول. القراءة - تصفية الصعوبات القرائية ب ت ث ن ي
specially in arabic language.
if you ever wanted to change the english numbers into arabic with titles, here it is: shamela_metadata_ar_numbers_2024-10.txt
Thank you... I just copy-pasted, so don't get how I achieved to make it wrong ...
don't mention it, it happens. maybe your mind was busy with something else. what matters is your work eventually to be perfect.
I just realized that some titles (category 3 and 5 at least) have too long titles. It should fit within 30 chars. @hamoudak can you please suggest fixes?
are spaces count? because I think the arabic title of cat 3 is a 30 chars. and cat 5 is 29 chars. (without spaces) edit: I went to charactercalculator website, they've indeed exceeded the 30 characters. (with spaces) you mean the arabic titles ! if so you can take the short word : التفسير title: التفسير؛ المجموعة رقم 3 as for cat 5 you can use: التجويد title: التجويد؛ المجموعة رقم 5
when you go to cat 6, , rename it to : al-sunna if you made " alshir-wa-dawawinu once again, put a hyphen after (al) " al-shir" . not a big issue if you didn't though. if there's something else you find not fitting, please tell me, or I can fix them all if they count with spaces.
EDIT:
I have made fixes and suggestions of all titles that is if [spaces] count too.
Yes, spaces count. And I'm speaking about the arabic titles and descriptions. The "name" in english is not really limited (but the shorter the better obviously since it will be placed into ZIM filename)
you are right. descriptions are ok I think in the text file, or you just want it to be better for the long ones. I can make this. they haven't exceeded the permitted chars.
I have made descriptions a little bit shorter here; you can work with, if you prefer it : fixes-and-suggestions-descriptions.txt
why in https://dev.library.kiwix.org/#lang=&q=%D8%A7%D9%84%D8%B9%D9%82%D9%8A%D8%AF%D8%A9 is there رق which means number is not updated as you did in its config . and by the way I can tell you many mistakes in translation to arabic or "typo" for all the zim files on lib.kiwix. such as the description of مدرسه (madrasa) . there is an arabic word is wrong فيدوات , it should be فيديوهات. the plural of the word "video" . and many others related to the discriptions of arabic wikipedia.
why in https://dev.library.kiwix.org/#lang=&q=%D8%A7%D9%84%D8%B9%D9%82%D9%8A%D8%AF%D8%A9 is there رق which means number is not updated as you did in its config .
This is because the task was already running with "bad" description, as can be seen at https://farm.openzim.org/pipeline/c10047a3-8243-4e4b-b603-519d75d42d4c/debug, when I fixed it. I didn't wanted to interrupt it since it takes time to complete. I will run it again with the updated config, it is going to be fast, I've configured the task to keep an intermediate archive to rebuild the ZIM fast.
Thank you for your attention, much appreciated!
and by the way I can tell you many mistakes in translation to arabic or "typo" for all the zim files on lib.kiwix. such as the description of مدرسه (madrasa) . there is an arabic word is wrong فيدوات , it should be فيديوهات. the plural of the word "video" . and many others related to the discriptions of arabic wikipedia.
To fix this, you can open a dedicated issue "fix typo in arabic words" and precisely list everything you can. Unfortunately most of us are kiwix are not arabic writers / speakers at all, and we somehow trusted the person who worked with us to not input typos. We should probably double or triple check this.
yes, I know the most of you from Switzerland, its hard to know arabic well. I was graduated from the faculty of languages and translation long time ago, but I was affected by some troubles. I do my best though. and I have the facility to revise texts.
would I open it in zim-requests?
would I open it in zim-requests?
Yes please
I think their server is temporarily down for maintenance. could you make my book request till it comes back. the website is stable compared to the other one.
it came back. pages in the zim file for "ulum-al-quran" not working, due to this server issue. you should start the scraper from the very beginning to get all pages. and I hope your next zim started after server fixing .
shamela.ws_ar_al-tafsir-3 failed while it was close to completion. It ran way slower than expected, I will have to check all this before starting again a task.
I have noticed that the minute it was done, and I wanted to ask you why it failed. you may run a smaller cat now till you check this one. I see that maybe you ran more tasks besides this one or it has nothing to do with it !
why scroll bar is on the left while its on the right side on the website and the desktop app version !