rsaim / supplementary

Portal to analyze and visualize results of DTU students.
1 stars 0 forks source link

Determine uniform output from the parsing of pdf files #3

Open rsaim opened 4 years ago

rsaim commented 4 years ago

Current parsing logic only extracts the grades. We should extract metadata like semester number, branch name, credits, spi , etc. I have done preliminary work to extract this information and provide a uniform json detailed below.

        results = [
            {
                "name"                : "<student full name>",
                "rollno"              : "<rollno>"
                "program"             : "<btech, mtech, etc>",
                "branch"              : "<branchname>",
                "semester"            : "<semester number>",
                "pdf_filename"        : "<filename>",
                "pdf_pagenum"         : "<pagenum>",
                "release_date"        : "<date>",
                "examination_date"    : "<date>",
                "notice"              : "<notice>",
                "SPI"                 : "<spi>",
                "total_credits"       : "<total_credits>",
                "papers_failed"       : ["sub1_code", "sub2_code", ...],
                "marks"               : {
                    "<subject1_code>"    : "<marks>",
                    "<subject2_code>"    : "<marks>",
                    ...
                }
            },
            # ...
        ]

I am currently using pdfminer and tabula to extract the data. Some related work was done in 2366fc1ebac70b31cc79e0dc9bc1829dab46fe05

FYI - @tezas @himanshuhy

rsaim commented 4 years ago

I tried improving the logic to extract all the information stated in this issue. However, I was able to correctly parse only 63/1321 files I downloaded.

@tezas @himanshuhy - We would need to discuss this further as its getting a bit hectic for me :(