jarun / buku

:bookmark: Personal mini-web in text
GNU General Public License v3.0
6.52k stars 294 forks source link

Mismatch between number of bookmarks in Firefox and number of bookmarks imported #679

Closed i2 closed 1 year ago

i2 commented 1 year ago

Hello!

I have 12182 unique bookmarks in Firefox. I export them to html file and when I run buku -i bookmarks.html, it only imports 12131 of them!

Now, I have two questions:

Based on the nature of this report, I didn't know how to provide details in a reproducible way, but please let me know if there is any information is needed.

jarun commented 1 year ago

Those are your bookmarks. Please check what is omitted. Probably duplicates.

LeXofLeviafan commented 1 year ago

Why this disparity?

The likely reason is that buku doesn't allow for duplicate URLs (i.e. multiple bookmarks of the same URL)

It would be helpful to show user what entries have been excluded.

This can be done by running a Python script that uses buku as a library (pass your bookmarks file as CLI argument):

#!/usr/bin/env python
import sys
from collections import Counter
from bs4 import BeautifulSoup
from buku import import_html

with open(sys.argv[1]) as fin:
    html = BeautifulSoup(fin, 'html.parser')
bookmarks = list(import_html(html, add_parent_folder_as_tag=True, newtag="", use_nested_folder_structure=False))
count = Counter(x[0] for x in bookmarks)

for x in bookmarks:
    if count[x[0]] > 1:  # Note: this version shows *all* instances of a duplicate bookmark
        print(x)         #       (including the first one which isn't rejected)
print("Total bookmarks:", len(bookmarks))
print(f"(unique: {len(count)}, duplicates: {sum(n-1 for x, n in count.items())})")

(Alternatively, with this version of buku.py you can use x.url instead of x[0], and get more sensible output as well)

i2 commented 1 year ago

@LeXofLeviafan, thanks for your comprehensive response and the code snippet. The reason I opened this issue was I suspected there might be a bug that needs a fix in buku (and might be something that other people would face too).

As I mentioned above, my bookmarks are de-duplicated (I used both "Bookmarks Dupes" and "Bookmark clean up" extensions for this). First thing that came to my mind for this difference in number was bookmarks that start with javascript: but I could see in my buku that at least some javascript URIs were imported just fine.

I looked a bit closer (and used your code snippet) and I confirm the bookmarks that cause this disparity are all: javascript URLs that have matching title but have slightly different URL. And buku considers them duplicate.

Also, have you thought about asking for a MR based on api-fix branch to the main project?

Thanks!

LeXofLeviafan commented 1 year ago

the bookmarks that cause this disparity are all: javascript URLs that have matching title but have slightly different URL. And buku considers them duplicate.

This… doesn't seem like something that should happen. As far as I can tell from the code, only duplicate and empty URLs are rejected by buku.

Maybe you can check what's really happening there by running buku with -g/--debug flag? It should enable printing out warnings and errors.

Also, have you thought about asking for a MR based on api-fix branch to the main project?

I've discussed the idea with the project owner before… Problem is, he believes that not changing library API is better than improving it (…and he also believes that it should imitate CLI as closely as possible, no matter how badly it affects actual usability).

jarun commented 1 year ago

@LeXofLeviafan I see you have omitted the point that I am worried about breaking downstream. It's OK you find pleasure in finding faults with others and advertising those, but at least keep the record straight.

@i2 what's your business with that branch? Are you writing a tool important enough to make changes?

LeXofLeviafan commented 1 year ago

You're making it sound like I misrepresented your stance somehow. I'm pretty sure it was perfectly accurate tho. (And motivations for avoiding changes in an API are the same for any project… doesn't mean such changes are never made because of them.)

jarun commented 1 year ago

You just represented your own perspective of it.

i2 commented 1 year ago

@jarun First of all, thank you for creating this project.

To enrich my own projects, I like to think inclusivity benefits everyone (The above situation was just an example). That code snippet and said branch helped me pin down what the issue was, so for me it was of value (I spent hours on this, as I had about 50k bookmarks!), thank you @LeXofLeviafan!

My issue is completely resolved. The culprit was: bookmarks that had HTML URL Encoding (Percent Encoding) like %2F for space and etc.