s39674 / Image2schematic

Trying to extract pcb schematics from images using computer vision
31 stars 14 forks source link

Extracting pin information from datasheets #20

Closed s39674 closed 2 years ago

s39674 commented 2 years ago

Currently, in branch: Dev_fetchViaHTTP_fix#15 we query www.octopart.com to get the chip's datasheet. In order to obtain the pin information, we need to analyze the pdf. Datasheets that have a dedicated pin table, like this:

https://datasheet.octopart.com/TC1427CPA-Microchip-datasheet-22875.pdf

drawing

Are quite easy to process using PyPDF2:

response = requests.get(FirstBestDataSheetUrl)
reader2 = PdfReader(io.BytesIO(response.content))
if len(reader2.pages) > 0:
    for page in reader2.pages:
        if "PIN FUNCTION TABLE" in page.extract_text():
            print(page.extract_text())
# print:
1 NC No connection.
2 IN A Control input A, TTL/CMOS compatible logic input.
3 GND Ground.
4 IN B Control input B, TTL/CMOS compatible logic input.
5 OUT B Output B, CMOS totem-pole output.
6V
DDSupply input, 4.5V to 16V.
7 OUT A Output A, CMOS totem-pole output.
8 NC No connection.

The problem is when we encounter PDFs that don't use a pin function table but instead use this form: image https://www.digchip.com/datasheets/download_datasheet.php?id=467924&part-number=MC74HC589A&type=pn2

Any ideas on how to deal with that?

ObaidAshraf commented 2 years ago

Just an FYI, I made some progress on this problem. I was able to extract pin information from an image using OCR techniques. Below are the results: pins2

[116.0, 172.0, ('VDD', 0.9834039211273193)]
[292.0, 172.0, ('1', 0.9982110261917114)]
[748.0, 170.0, ('14', 0.9994282722473145)]
[900.0, 174.0, ('Vss', 0.961628258228302)]
[126.0, 306.0, ('GPO', 0.9020509123802185)]
[290.0, 304.0, ('2', 0.9986997842788696)]
[748.0, 306.0, ('13', 0.9996862411499023)]
[888.0, 308.0, ('D+', 0.9965775609016418)]
[126.0, 438.0, ('GP1', 0.9987953305244446)]
[292.0, 440.0, ('3', 0.9960875511169434)]
[748.0, 440.0, ('12', 0.9991488456726074)]
[878.0, 440.0, ('D-', 0.9935600757598877)]
[126.0, 578.0, ('RST', 0.996580183506012)]
[292.0, 576.0, ('4', 0.9995313882827759)]
[748.0, 574.0, ('11', 0.9991548657417297)]
[918.0, 580.0, ('VUSB', 0.9856255054473877)]
[132.0, 714.0, ('URx', 0.8305797576904297)]
[288.0, 710.0, ('5', 0.9993932247161865)]
[746.0, 710.0, ('10', 0.9993680119514465)]
[910.0, 710.0, ('SCL', 0.989795982837677)]
[122.0, 850.0, ('UTX', 0.8617784976959229)]
[292.0, 846.0, ('6', 0.960120439529419)]
[748.0, 844.0, ('9', 0.9891974329948425)]
[914.0, 850.0, ('SDA', 0.9971482157707214)]
[140.0, 982.0, ('GP2', 0.9968588948249817)]
[278.0, 980.0, ('7', 0.930237352848053)]
[746.0, 978.0, ('8', 0.9805749654769897)]
[914.0, 986.0, ('GP3', 0.9990615844726562)]

pins

[620.0, 46.0, ('PIN ASSIGNMENT', 0.974632203578949)]
[458.0, 180.0, ('B1.', 0.8615524172782898)]
[814.0, 183.0, ('16  Vcc', 0.9195017218589783)]
[436.0, 284.0, ('c2', 0.8155210018157959)]
[786.0, 284.0, ('15A', 0.9963900446891785)]
[436.0, 380.0, ('D3', 0.9665333032608032)]
[797.0, 386.0, ('14SA', 0.9280725717544556)]
[1018.0, 452.0, ('SERIAL SHIFT/', 0.9841246008872986)]
[438.0, 484.0, ('E4', 0.9762129783630371)]
[740.0, 486.0, ('13', 0.999295711517334)]
[1044.0, 518.0, ('PARALLELLOAD', 0.9968366622924805)]
[438.0, 580.0, ('F5', 0.9857825040817261)]
[952.0, 582.0, ('12  LATCH CLOCK', 0.9837801456451416)]
[436.0, 686.0, ('G6', 0.9747713804244995)]
[944.0, 686.0, ('11  SHIFT CLOCK', 0.9725729823112488)]
[941.0, 749.0, ('OUTPUT', 0.9937143325805664)]
[436.0, 782.0, ('H g 7', 0.8870368003845215)]
[738.0, 784.0, ('10', 0.9980778694152832)]
[933.0, 818.0, ('ENABLE', 0.9980594515800476)]
[394.0, 886.0, ('GND 8', 0.9146793484687805)]
[814.0, 886.0, ('9QH', 0.9298664927482605)]

As can be seen, the results for second image are not correct. Additionally, it looks like OCR results differ on different machines. It is still work in-progress.

Thanks !!

s39674 commented 2 years ago

@ObaidAshraf Nice! that's looking great. (I apologize for not posting anything about this, way too deep into PCB-CD right now). Did you use EasyOCR for that? If so, I wonder how hard it is to extract the area of the IC pin assignment to feed into EasyOCR, pdf is quite hard to work with python.

Maybe you can use the x and y coordinates of the detected text boxes to crop the left and right columns of labels and then apply text detection on those cropped regions, and then concatenate them together? What do you think? (that would get rid of the pin numbers getting detected)

Thank you so much for working on this!

ObaidAshraf commented 2 years ago

@s39674 I have used PaddleOCR package. It provides much better results than EasyOCR. Regarding the Pin Assignment area, it seems required to somehow extract it manually from the datasheet and then feed it into OCR engine. If you are saying that, for the below image, we only extract left and right labels to get number of pins (by the labels and avoid pin numbers): pins

I think we can give it a try. But I think it will be a bit more difficult as we will have to somehow crop/identify only left and right areas and leave the middle. Additionally, I just found-out that PaddleOCR work well with GPU. For systems without GPU, it doesn't work that well .. I seems another problem to resolve.

s39674 commented 2 years ago

@ObaidAshraf If we absolutely have to manually select that area then I don't think it's worth it, using #22 would be a lot easier.

I can't get PaddleOCR to work so I don't have a proper example to show, but something like this should work:

firstValidIndex = [458.0, 180.0, ('B1.', 0.8615524172782898)]
secondValidIndex = [814.0, 183.0, ('16  Vcc', 0.9195017218589783)]
offset = 10

# Cropping lables
leftPart = img[ : , : firstValidIndex[0] - offset ]
rightPart = img[ : , secondValidIndex[0] + offset : ]
ObaidAshraf commented 2 years ago

@s39674 I get your point .. By manually selecting the area I meant to fetch pin diagram from datasheet (using OpenCV etc) and then use OCR algorithm to extract pins information. but it is also true that it will add additional complexity to select the correct area which contains PIN information.

s39674 commented 2 years ago

I understand. I just think that if we can't easily and efficiently convert the pdf to images with python and process them with OCR then this method would take too long for just one IC, let alone for every chip on a pcb. https://github.com/Belval/pdf2image looks like a good option, but quite memory intensive. I'm not sure what is the best route from here.

ObaidAshraf commented 2 years ago

@s39674 I think there is another solution which can give us correct idea about number of pins associated with an IC by just having its model number. If we try to search for a part on any component engine, for instance: https://www.sourcengine.com/

It gives us package information (which is also available in datasheet). The package information provides exact number of pins of the IC. For example, when I search MC74HC589A , I get below results: image

Additionally, if we open the very first link, we get Number of terminals information, as below: image

So, it seems like we can get the required number of Pins information from website (or even datasheet). We just need to fetch it somehow. I think Octopart API also provides similar results. What do you think?

Thanks !!

s39674 commented 2 years ago

Nice! That's very helpful. Looks like octopart also serves this info. I have added this functionally to TestingRequests.py. What do you think?

ObaidAshraf commented 2 years ago

@s39674 It will be a good idea to get this info directly from octopart .. I believe it will save the hassle to parse datasheet or images to get number of pins (or labels) ..

s39674 commented 2 years ago

What do you mean? I'm getting it from octopart: https://github.com/s39674/Image2schematic/blob/e5299e8a85a8ffabe94327f9ea38c594bdb6eaca/TestingRequest.py#L74 https://github.com/s39674/Image2schematic/blob/e5299e8a85a8ffabe94327f9ea38c594bdb6eaca/TestingRequest.py#L82-L93

The pdf stuff at the bottom is only there temporarily, I will delete it later.

ObaidAshraf commented 2 years ago

@s39674 The code snippet seems fine. Is it working as expected and are you getting correct number of pins?

s39674 commented 2 years ago

@ObaidAshraf yes, but now it returns a 403, requesting a captcha completion.

ObaidAshraf commented 2 years ago

@s39674 ahh, it seems like Octopart is using some anti-scraping mechanism. But it shouldn't bother if you are using octopart API. 403 status code means Forbidden access. It also happens when many requests are sent back-to-back without some delay,

s39674 commented 2 years ago

hmm.. looks like changing the user agent does fool it. Anyway I think this issue, and #15 can be closed after merging fetchViaHTTP branch. Although I think we should query one more site just as a fallback. Not sure where #22 is going. Let's focus on more important stuff, over on PCB-CD I will try to get a working BOM extractor until next week, would very much like your help. Thank you again for your work!

ObaidAshraf commented 2 years ago

@s39674 I am glad to help. Regarding the fallback site, we can use sourcengine.com but we can keep it for later improvements. Let me know what is more important right now? Is it PCB-CD or something else?

Thanks !!

s39674 commented 2 years ago

Definitely PCB-CD. I feel like getting a BOM extractor would add a lot of value for this project and may be actually useful for someone.

s39674 commented 2 years ago

I'm closing this issue for now as #24 provides a temp fix for this. This issue may be opened in the future if we can find an efficient way to obtain pin labels from the datasheet. Thank you for your support!