go-shiori / shiori

Simple bookmark manager built with Go
MIT License
9.48k stars 555 forks source link

Importing Netscape style bookmarks.html file - What gets in and what gets left out #125

Open m040601 opened 5 years ago

m040601 commented 5 years ago

First of all thank you for your work in this wonderfull project. I'm testing shiori importing, querying and caching articles from big, very big bookmarks.html files.And I have a question and some feedback about the import (and export) process in general, not just big files. Maybe getting a little bit more detailed about this issue is usefull for other users' similar questions as well. Pardon me if this is detailed somewhere else, but I could not find a complete answer by searching the issues and doc.

Why is it important

These files were produced by exporting your bookmarks from the old delicious.com bookmarking service. Exporting your Pinboard bookmarks, your Firefox bookmarks, or even your Wallabag/Pocket or similar list of bookmarks articles should produce the same style of html file and raise the same questions.

These files are the very old, Netscape style bookmark file. They still seem to be working fine after all these years, no matter what criticism or modern jsons and places.sqlites. I'm not sure how standardized they are, and apologies if I'm saying something wrong.

But I think a decisive functionality for Shiori, especially new users, is in quickly getting your "meat", your data, in and out from other places, to quickly try out Shiori with the data you already have . Be it other browsers, databases, or online services. Understanding well how these files work and what gets imported/exported or not is critical, so that nobody gets surprised if any bit of information or data they had somewhere is lost. After all these files contain all the "work" you had before bookmarking and cataloging things.

Remember the mobile usage increase in the last years, so many apps, services, cloud storage and facebook, pocket, etc proprietary data silos. For many, a bookmarks file is a thing they never heard about. I experience this in teaching, many students cant even use a desktop browser to anything other than clicking links on a web page and subscribing to an online service. The idea that you can be responsible for managing and storing privately and offline your collection (bookmarks) with minimal effort, is alien.

Tags and Folders are different schemes, but Firefox lets' you use both at the same time

Remember the difference between folders and tags:

The typical Netscape style bookmarks.html file

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<!-- This is an automatically generated file.
It will be read and overwritten.
Do Not Edit! -->
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
...
... lots and lots of boookmark entries
...
</DL><p>

Those DL tags can also be repeated and nested in more complicated structures or "categories". Thats what you get if you bookmark links in Firefox also in folders and subfolders, creating a tree. main-folder __animals-folder __- page1-bookmark-qute-dogs-famous-page-on-the-internet __- page2-bookmark-qute-cats-famous-page-on-the-internet __- page3-bookmark-qute-bunny-famous-page-on-the-internet news-folder __new-york-times-folder -article-4-bookmarktrump-won-the-elections __-article-5-bookmark-new-iphone-autumn-2019 bbc-folder -article-6-bookmark-queen-visit-japan -article-7-bookmark-car-accident-last-friday

This is why Shiori asks you "do you want to generate tag from category". "Category" is that folder names you had. So:

  1. If you hadnt used tags in Firefox Bookmarks and you only had folders, You end up with each bookmark (article or page) "tagged" with only one tag. That from the XYZ-folder name where it was.

  2. If you had mixed organizing in your Firefox bookmars with folders AND tags then: (Correct me if I'm wrong,) Shiori picks up your firefox tags, and additionaly adds an extra with the name of the folder.

Now let's have a look at a typical bookmark entry.It is composed of 2 lines starting with the tags "DT" and "DD". These tags are not closed. Each bookmark has to have information at least for title, url, tags, time-added and a "description" field. Example:

<DT><A HREF="https://wiki.ubuntu.com/Core" ADD_DATE="1314757883" PRIVATE="0" TAGS="ubuntu,appliances,tools,distros">Ubuntu Core - Ubuntu Wiki</A>
<DD>Ubuntu Core is a minimal rootfs for use in the creation of custom images for specific needs. Ubuntu Core strives to create a suitable minimal environment for use in Board Support Packages, constrained or integrated environments, or as the basis for application demonstration images.
Ubuntu Core delivers a functional user-space environment, with full support for installation of additional software from the Ubuntu repositories, through the use of the apt-get command.

1. That DT line

That is from where Shiori gets the bookmark url, title and tags. It ignores (?) the ADD_DATE and PRIVATE attributes, that were (probably ?) specific to delicious.com (or pinboard or other bookmarking services. That ADD_DATE attribute is interesting, because that remembers me the date when I first encountered and bookmarked something on the Internet.

Firefox, when it exports your bookmarks to html, it also adds some extra attributes like ICON and ICON_URL. They too get ignored by Shiori when importing.

Question1: What exactly are the tags and attributes from the DT line that are "guaranted" to get imported to Shiori ? Which ones are the ones that Shiori will guarantee to export ? Would it be possible to import that ADD_DATE attribute ?

2. That DD line

That is the more interesting one. That is the place for a description or note about the bookmark. On these online bookmarking services (Pinboard, Delicious etc) that's where you could have entered a couple of lines of text with, some notes to yourself or whatever you wanted to comment and remember later. Extremely useful, for example for research and academic purposes. Firefox, it seems ?, some times also uses that DD line. It automaticaly populates it with some text extract from the main content.

Question2: that DD line content is being imported to Shiori right ? I can see it when I list my bookmarks on the console. I just dont see it on the browser. Would it be possible to ,optionally, display it on the browser as well ?

This issue is possibly related to , https://github.com/RadhiFadlillah/shiori/issues/121

ADDITIONAL: Experience importing a very large bookmarks file

(See also, https://github.com/techknowlogick/shiori/issues/144)

In my case this was produced by delicious.com. I've been testing with different sizes to import and display them on shiori, both the console and the browser. I want to get an idea of what shiori, sqlite and Go can cope with . I want to see what cpu/ram/disk ressources are needed for getting good performance. Because it's a big file it may contain errors, illegal characters, or be damaged. I want to test the import process for data corruption issues.

To get an idea, if you bookmarked for a couple of years stuff on the internet and have 20000 well commented bookmarks, you have an 8 Mega html file with all your "work".

I have now tried with up to 10000 bookmarks and I haven't updated the cache for all articles inside shiori. Only individual ones or selected by searching some tags. In the end, I ended up with a shiori.db sqlite database of around 5 mega. Not bad.

On a 10 year old athlondesktop pc with an ssd disk and lots of ram, Shiori seems to get the job done. That is, provided the data is clean and there are no illegal characters inside it. It takes a lot of time and the CPU spikes its usage (not sure if thats an sqlite weakness thing ?) but it ended well without any major glitches. Same for the console usage.

On the browser things are a little bit different. Although it displays everything correctly and I think it's a very elegant interface you designed, with 10000 bookmarks, it's very sluggish. when I press the "Reload" button, sometimes blocks and sometimes it times out. The CPU usage of the shiori go process spikes as well, but not the browser cpu usage. Funny thing is, if I click on "Show Tags" and start picking tags, it doesnt feel slugish. I need more testing to pinpoint.

robotamer commented 2 years ago

Very interesting, hope you get some answers soon