stephen-hill / musikcube

Automatically exported from code.google.com/p/musikcube
Other
0 stars 0 forks source link

Indexer should recognize moved files. #2

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
The indexer should recognize moved files. This can be done in several ways.
The best way should be to look for files with same artist, album and title
and check if any of them are unavailable.

Original issue reported on code.google.com by onne...@gmail.com on 18 Mar 2008 at 8:51

GoogleCodeExporter commented 9 years ago

Original comment by bjorn.ol...@gmail.com on 20 Mar 2008 at 5:21

GoogleCodeExporter commented 9 years ago
The safest way should be to checksum the audio part of the audiofile and 
compare that
to the removed files checksum. This way the tags does not mater. 
Although, checksumming the audiofile requires the tagger to read the whole 
file, and
that may be very slow.

Original comment by onne...@gmail.com on 27 Mar 2008 at 12:21

GoogleCodeExporter commented 9 years ago
Calculating a checksum is slow but would probably be the best way to find
duplicates/moved files.  Perhaps the checksum should be stored as a tag in the 
file
as well as in the DB.

If no checksum is found, using Musicbrainz IDs or something similar could be 
useful.

Original comment by bjorn.ol...@gmail.com on 28 Mar 2008 at 6:59

GoogleCodeExporter commented 9 years ago
I guess I could start by comparing the safe stuff:
  * duration
  * bitrate
  * sampleRate
  * channels
Secondly the little less safe stuff:
  * title
  * artist
  * album

The first part must have an exact match, and the second part should have the 
title
and either artist or album right. If the above is true (and assuming the found 
file
is removed) the file should be considered moved.

If there are multiple matches, I need to start comparing more tags in a 
prioritized
list until there is only one left.

Original comment by onne...@gmail.com on 28 Mar 2008 at 11:26

GoogleCodeExporter commented 9 years ago

Original comment by onne...@gmail.com on 21 Apr 2008 at 6:51

GoogleCodeExporter commented 9 years ago
I am no coder (as you know), but recognizing "old" files would be nice for not 
loosing your song statistics. Why not make checksum when adding a file to DB 
(or 
changing tags) and store it there. If that file is moved, mC2 would add new 
found 
files, make their checksum, compare with old ones and then delete or adjust 
missing/
moved files.

Original comment by HomiSite on 22 Apr 2008 at 4:14

GoogleCodeExporter commented 9 years ago
Hey All,

DocTriv pointed me at this issue.  I recently started a thread about this on 
the mC
forums requesting this feature.

I'm sure you guys realize this, but adding a checksum to the db (and possibly a 
tag)
would be a much larger benefit than just allowing the indexer to recognize moved
files.  It would also allow for the merging of databases which would allow all 
sorts
of flexibility, namely the collecting into a single repository metadata from 
several
separate instances of mC.  There would be a slight performance hit, but my very
unscientific testing shows with CRC32 checksums about 67 mp3s/sec can be 
processed on
a relatively fast box.

Keep up the good work guys.

Thx,
-Mid

Original comment by midnigh...@gmail.com on 22 May 2008 at 6:42

GoogleCodeExporter commented 9 years ago
67 mp3s/sec checksum sound quick. My avarage mp3 filesize is 5Mb whitch would 
mean an
IO speed of 335Mb/sec. This leads me to beleave that this must be a partial 
checksum
of some sort.
Using a partial checksum is not a bad idea though. Maybe do a checksum on the 
first
1KB of the audiopart of the file.

I think many people put their files on a network share of some sort. So I/O is a
critical issue.

Original comment by onne...@gmail.com on 22 May 2008 at 7:51

GoogleCodeExporter commented 9 years ago
Good catch.  I was using a duplicate file checking program (doublekiller) to 
generate
all the crc32 checksums for comparison.  I ran it again and watched IO 
performance
and paid closer attention to what it was doing.  It must do exactly as you 
mentioned
and just do a checksum on a very small portion of the song as there was almost 
no IO
for the first pass, then it appears to go back and do full checksums on the
collisions which pegged disk IO.  The full checksums still appeared to zip by 
quickly
but I've no way to judge peroformance with such a crude test.

Sorry for misleading everyone.  Doing the partial checksums of a specific 
segment or
two of each song would increase performance dramatically, not sure how big the
segment would need to be to reduce collisions down to an acceptable rate 
(hopefully
near 0).

-Mid

Original comment by midnigh...@gmail.com on 22 May 2008 at 8:26

GoogleCodeExporter commented 9 years ago
I think the partial checksum is a good idea. Although, thinking about it, maybe 
take
the checksum not at the very beginning. Many songs tend to be quiet in the 
beginning,
so checksum could (but not likely) be the same.
To summarize:
Comparing a file for being move (or anything else) should consider the 
following:
  * duration
  * bitrate
  * sampleRate
  * channels
  * partial checksum

Original comment by onne...@gmail.com on 22 May 2008 at 8:41

GoogleCodeExporter commented 9 years ago
This is probably obvious but I just thought of it so I thought I'd put it out 
there.

To cut down on collisions caused by using a small segment checksum (if there 
are any)
you could also use other static information about the mp3.  IE, if there is a
checksum collision but the song length is different it must be a different 
song. 
Wouldn't be perfect but it'd be a quick painless way to double check.

-Mid

Original comment by midnigh...@gmail.com on 22 May 2008 at 8:45

GoogleCodeExporter commented 9 years ago
I think CRC32 does do a full checksum and the reason it is quick is because it
doesn't analyse the who file. It pick blocks out that it needs.

Checking a file against it's CRC Hash is the only reliable way to check to see 
if a
file have moved. MP3 tags can be changed outside of MusikCube and this would 
result
in a different hash.

This would be good for merging databases :)

Original comment by gatekil...@gmail.com on 23 May 2008 at 10:36