Closed i2 closed 1 year ago
Those are your bookmarks. Please check what is omitted. Probably duplicates.
Why this disparity?
The likely reason is that buku doesn't allow for duplicate URLs (i.e. multiple bookmarks of the same URL)
It would be helpful to show user what entries have been excluded.
This can be done by running a Python script that uses buku as a library (pass your bookmarks file as CLI argument):
#!/usr/bin/env python
import sys
from collections import Counter
from bs4 import BeautifulSoup
from buku import import_html
with open(sys.argv[1]) as fin:
html = BeautifulSoup(fin, 'html.parser')
bookmarks = list(import_html(html, add_parent_folder_as_tag=True, newtag="", use_nested_folder_structure=False))
count = Counter(x[0] for x in bookmarks)
for x in bookmarks:
if count[x[0]] > 1: # Note: this version shows *all* instances of a duplicate bookmark
print(x) # (including the first one which isn't rejected)
print("Total bookmarks:", len(bookmarks))
print(f"(unique: {len(count)}, duplicates: {sum(n-1 for x, n in count.items())})")
(Alternatively, with this version of buku.py
you can use x.url
instead of x[0]
, and get more sensible output as well)
@LeXofLeviafan, thanks for your comprehensive response and the code snippet. The reason I opened this issue was I suspected there might be a bug that needs a fix in buku (and might be something that other people would face too).
As I mentioned above, my bookmarks are de-duplicated (I used both "Bookmarks Dupes" and "Bookmark clean up" extensions for this). First thing that came to my mind for this difference in number was bookmarks that start with javascript:
but I could see in my buku that at least some javascript URIs were imported just fine.
I looked a bit closer (and used your code snippet) and I confirm the bookmarks that cause this disparity are all: javascript URLs that have matching title but have slightly different URL. And buku considers them duplicate.
Also, have you thought about asking for a MR based on api-fix branch to the main project?
Thanks!
the bookmarks that cause this disparity are all: javascript URLs that have matching title but have slightly different URL. And buku considers them duplicate.
This… doesn't seem like something that should happen. As far as I can tell from the code, only duplicate and empty URLs are rejected by buku.
Maybe you can check what's really happening there by running buku with -g
/--debug
flag? It should enable printing out warnings and errors.
Also, have you thought about asking for a MR based on api-fix branch to the main project?
I've discussed the idea with the project owner before… Problem is, he believes that not changing library API is better than improving it (…and he also believes that it should imitate CLI as closely as possible, no matter how badly it affects actual usability).
@LeXofLeviafan I see you have omitted the point that I am worried about breaking downstream. It's OK you find pleasure in finding faults with others and advertising those, but at least keep the record straight.
@i2 what's your business with that branch? Are you writing a tool important enough to make changes?
You're making it sound like I misrepresented your stance somehow. I'm pretty sure it was perfectly accurate tho. (And motivations for avoiding changes in an API are the same for any project… doesn't mean such changes are never made because of them.)
You just represented your own perspective of it.
@jarun First of all, thank you for creating this project.
To enrich my own projects, I like to think inclusivity benefits everyone (The above situation was just an example). That code snippet and said branch helped me pin down what the issue was, so for me it was of value (I spent hours on this, as I had about 50k bookmarks!), thank you @LeXofLeviafan!
My issue is completely resolved. The culprit was: bookmarks that had HTML URL Encoding (Percent Encoding) like %2F for space and etc.
Hello!
I have 12182 unique bookmarks in Firefox. I export them to html file and when I run
buku -i bookmarks.html
, it only imports 12131 of them!Now, I have two questions:
Based on the nature of this report, I didn't know how to provide details in a reproducible way, but please let me know if there is any information is needed.