opengovsg / pdf2md

A PDF to Markdown converter
https://www.npmjs.com/package/@opendocsg/pdf2md
MIT License
210 stars 40 forks source link

Support to respect the layout of the original pdf file #68

Open Nan-Do opened 1 year ago

Nan-Do commented 1 year ago

I'm trying to convert some files that contain python code but the tool doesn't respect the original formatting and prints the files without any spacing. For example, a pdf containing the following text:

# Time: O(n)
# Space: O(n)

# freq table

Next there is the solution to the proposed problem using Python2:

class Solution(object):
    def isGood(self, nums):
        """
        :type nums: List[int]
        :rtype: bool
        """
        cnt = [0]*len(nums)
        for x in nums:
             if x < len(cnt):
                 cnt[x] += 1
             else:
                 return False
        return all(cnt[x] == 1 for x in xrange(1, len(nums)-1))

Is translated into:

# Time: O(n) # Space: O(n)

# freq table

Next there is the solution to the proposed problem using Python2:

class Solution(object): def isGood(self, nums): """ :type nums: List[int] :rtype: bool """ cnt = [0]*len(nums) for x in nums: if x < len(cnt): cnt[x] += 1 else: return False return all(cnt[x] == 1 for x in xrange(1, len(nums)-1))

In this case, it doesn't detect it as a code block, in some other examples, the tool detects the code blocks correctly but still removes the initial spacing. One such example is this book

Is there a way to force the tool to respect the original formatting?

LoneRifle commented 1 year ago

Could you try 0.1.25 and verify if the problem is present there too?

Nan-Do commented 1 year ago

Sure! I just have tried with version 0.1.25 and the output is exactly the same in regards with the python formatting issue. The same behavior of ignoring the initial spacing also happens with pdf-to-markdown