harshankur / officeParser

A Node.js library to parse text out of any office file. Currently supports docx, pptx, xlsx and odt, odp, ods..
MIT License
123 stars 17 forks source link

Sort PowerPoint slides by slide numbers #38

Closed purfectliterature closed 1 week ago

purfectliterature commented 1 month ago

Currently, slides order is not guaranteed by decompress. In fact, after decompression, I found my slides order jumbled. This PR adds a sorting step before reading the XML files. It sorts the files by slide number, and if a slide n has notes, the notes for slide n will appear after slide n. putNotesAtLast can still override this order.

purfectliterature commented 1 week ago

@harshankur Good point, but technically this will never happen, right? The filter on line 135 ensures that all the files that made it through will have the slide numbers on them, as specified by the regex at lines 127 and 128.

harshankur commented 1 week ago

@harshankur Good point, but technically this will never happen, right? The filter on line 135 ensures that all the files that made it through will have the slide numbers on them, as specified by the regex at lines 127 and 128.

@purfectliterature Yes, I agree with you 100% that it will never happen with the current code but this is code smell which later MIGHT break things if the regex is used to do more things than what it is doing at the moment. I am planning bigger version updates and these kinds of things might result in an unwanted bug later on if overlooked. And keeping it in the current form does not give us any benefit either. It's better to guard those areas and keep code without such possible scenarios. :)

purfectliterature commented 1 week ago

@harshankur Okay, fair. If aNumber - bNumber returns NaN, then the || will allow Number(a.path.includes('notes')) - Number(b.path.includes('notes')) to return, and the latter will always return not NaN, i.e., correct numbers, because String.prototype.includes always returns a boolean. I don't think there's a way for this .sort to combust and throw an error due to NaNs. 😄

Anyway, what do you suggest should happen in the case the regex match fails? I can instead the matched string to 0 if you just don't like NaNs, but that's basically saying that invalid files have the first priorities (slide number zero). Not sure if that's behaviourally correct.

harshankur commented 1 week ago

@purfectliterature I don't have a problem with NaNs, I have a problem with NaN - NaN. And we can probably avoid that scenario by assigning a bigger number like Infinity as a fallback value in case something is NaN, thereby penalizing them for not being found in the regex.

I had something like this on my mind to add on to your solution.

decompress(filepath,
    decompressLocation,
    { filter: x => x.path.match(config.ignoreNotes ? slidesRegex : allFilesRegex) }
)
.then(files => {
    // Sort files by slide number and their notes (if any).
    files.sort((a, b) => {
        const aNumber = parseInt(a.path.match(slideNumberRegex)?.at(1), 10);
        const bNumber = parseInt(b.path.match(slideNumberRegex)?.at(1), 10);

        const validANumber = isNaN(aNumber) ? Infinity : aNumber;
        const validBNumber = isNaN(bNumber) ? Infinity : bNumber;

        return validANumber - validBNumber || Number(a.path.includes('notes')) - Number(b.path.includes('notes'));
    });

The other way is you can get items and their matched slide number in an array and do your sorting only if all the items have a valid slide number matched with the regex. I have not run this so far but I think this should work as well.

decompress(filepath,
    decompressLocation,
    { filter: x => x.path.match(config.ignoreNotes ? slidesRegex : allFilesRegex) }
)
.then(files => {
    const filesWithSlideNumbers = files.map(file => ({ file, slideNumber: parseInt(file.path.match(slideNumberRegex)?.at(1), 10) }));
    if (filesWithSlideNumbers.every(item => !isNaN(item.slideNumber)))
        filesWithSlideNumbers.sort((a, b) => a.slideNumber - b.slideNumber || Number(a.file.path.includes('notes')) - Number(b.file.path.includes('notes')));

    files = filesWithSlideNumbers.map(file => file.file);

I don't have a preference between either of the two solutions or if you have any better solution in mind. What do you think?

purfectliterature commented 1 week ago

@harshankur I think the first one will be more performant (theoretically) than the second one, without the additional O(n) iterations. Plus, the first one is easier to read and understand.

Thanks for your suggestions. I've made the requested changes and it should be good to go now!