Closed wanghaisheng closed 5 years ago
Piotr Gurgul May 3, 2017
0
With Dropbox’s document scanner, a user can take a photo of a document with their phone and convert it into a clean, rectangular PDF. In our previous blog posts (Part 1, Part 2), we presented an overview of document scanner’s machine learning backend, along with its iOS implementation. This post will describe some of technical challenges associated with implementing the document scanner on Android.
We will specifically focus on all steps required to generate an augmented camera preview in order to achieve the following effect:
[
](https://dropboxtechblog.files.wordpress.com/2017/05/01-docscanner-cast.gif)
Animated gif showing the live document preview in the android doc scanner
This requires custom interaction with the Android camera and access to individual preview frames.
Normally, when a third-party app requests a photo to be taken, it can be achieved easily in the following way:
Intent takePictureIntent = new Intent(MediaStore.ACTION_IMAGE_CAPTURE);
startActivityForResult(takePictureIntent, REQUEST_TAKE_PHOTO);
This delegates the task of taking a photo to the device’s native camera application. We receive the final image, with no control over intermediate steps.
However, we want to augment the live preview, detecting the document and displaying its edges. To do this, we need to create a custom camera application, processing each individual frame to find the edges, and drawing a blue quadrilateral that symbolizes the document’s boundaries in the live preview.
The whole cycle consists of the following steps:
[
](https://dropboxtechblog.files.wordpress.com/2017/05/02-new_diagram-002.png)
System diagram showing the main steps involved in displaying live previews of the detected document
Needless to say, steps (2) – (7) must take as little time as possible so that the movement of the blue quadrilateral appears to be smooth and remains responsive to camera movements.
It is believed than 10-12 frames per second is the minimum frequency required for the human brain to perceive motion. This means the whole cycle presented on the diagram should take no more than 80 ms. The Android hardware landscape is also very fragmented, which poses additional challenges. Cameras range from 0.3 to 24 megapixels, and unlike iPhones we can’t take the presence of any hardware feature (such as autofocus, back-facing camera or physical flash LED) for granted. The code needs to defensively check if each requested feature is there.
In the rest of the post, we’ll discuss each of the steps presented in the diagram.
The first step to the augmented reality preview is to create a custom camera preview without any augmented reality. For gaining access to the device’s camera, we will be using android.hardware.Camera object.
Note: The
android.hardware.Camera
has been deprecated in version 5.0 (API Level 21) and replaced with much more powerfulandroid.hardware.camera2
API. However, at the time of writing this post, roughly 50% of the active Android devices ran versions older than 5.0, so we were unable to avail of the improved camera API.
The very first step before starting preview is to confirm whether a device has a rear-facing camera. Unlike iOS, we cannot assume it is true; the Nexus 7 tablet, for example, was equipped with a front-facing camera only.
We can perform such a check using the following snippet:
PackageManager pm = context.getPackageManager();
pm.hasSystemFeature(PackageManager.FEATURE_CAMERA);
As per the documentation, PackageManager.FEATURE_CAMERA
refers to the camera facing away from the screen. To check for the presence of a front camera, there is a separate flag available – FEATURE_CAMERA_FRONT
. Hence, we are fine with the check above.
Tip: Accessing device camera requires proper permissions. This includes both defining required permissions in AndroidManifest.xml:
<uses-feature android:name="android.hardware.camera" android:required="false" />
<uses-feature android:name="android.hardware.camera.autofocus" android:required="false" />
<uses-feature android:name="android.hardware.camera.flash" android:required="false" />
and requestingpermission.CAMERA
permission at runtime so that it works on Android M and later versions.
Another issue is that the camera sensor orientation that can vary depending on a specific device. The most common one is landscape, but so-called “reverse landscape orientation” used for the Nexus 5X camera sensor has caused a lot of problems to many apps that were unprepared. It is very important to set the display orientation correctly so that it works properly regardless of the device’s specific setup. The snippet below shows how to do it.
private void setCorrectOrientation() {
CameraInfo info = new CameraInfo();
Camera.getCameraInfo(getBackCameraId(), info);
int orientation = getWindowManager().getDefaultDisplay().getRotation();
int degrees = 0;
switch (orientation) {
case Surface.ROTATION_0:
degrees = 0;
break;
case Surface.ROTATION_90:
degrees = 90;
break;
case Surface.ROTATION_180:
degrees = 180;
break;
case Surface.ROTATION_270:
degrees = 270;
break;
default:
throw new RuntimeException("Unsupported display orientation");
}
mCamera.setDisplayOrientation((info.orientation - degrees + 360) % 360);
}
Another very important thing to remember is the fact, that unlike iOS, there are multiple potential aspect ratios to support. On some devices, the camera capture screen has buttons that float over the preview, while on others there is a dedicated panel holding all the controls.
[
](https://dropboxtechblog.files.wordpress.com/2017/05/03-screenshot.jpg)
Camera capture screen on the Samsung Galaxy S5
[
](https://dropboxtechblog.files.wordpress.com/2017/05/04-screenshot-1.jpg)
Camera capture screen on the Xiaomi Mi4
This is why we need to calculate the optimal preview size with the closest aspect ratio to our preview rectangle.
The camera parameters object has a method called mCamera.getParameters().getSupportedPreviewSizes()
that returns a list of preview dimensions supported by a given device. In order to find the best match, we iterate through the returned list and find the closest dimensions to the current preview size that match our aspect ratio (with some tolerance).
This way, the document scanner will behave correctly even when unusual aspect ratio is needed due to e.g. operating in multi-window mode.
[
](https://dropboxtechblog.files.wordpress.com/2017/05/05-multi-windowmode.jpg)
Document scanner in multi-window mode on Samsung Galaxy S6 (Android 7.0)
There are several ways in which camera sensor data can be tied to an UI component.
The oldest and arguably simplest way is using SurfaceView
as shown in an official Google API demo example.
However, SurfaceView
comes with several limitations, as it’s just a drawing surface embedded inside the view hierarchy that is behind the window which contains all views. Two or more SurfaceViews
cannot be overlaid, which is problematic for augmented reality use cases such as the document scanner, as issues with z-ordering may arise (and these issues will be likely device-specific).
Another choice is a TextureView
which is a first-class citizen in the view hierarchy. This means it can be transformed, scaled and animated like any other view.
Once the camera object is acquired and parameters are set, we can start the preview by calling mCamera.startPreview()
.
Tip: It is very important to hold the camera object only when your app is in the foreground and release it immediately
onPause
. Otherwise, the camera may become unavailable to other apps (or our own app, if restarted).
In order to place UI components on top of the live preview, it’s best to use FrameLayout
. This way, vertical ordering will match the order in which components were defined in the layout file.
[
](https://dropboxtechblog.files.wordpress.com/2017/05/06-screenshot.jpg)
(1) First, we define TextureView
[
](https://dropboxtechblog.files.wordpress.com/2017/05/07-screenshot-1.jpg)
(2) On top of it, we place custom view for drawing quadrilateral
[
](https://dropboxtechblog.files.wordpress.com/2017/05/08-screenshot-2.jpg)
(3) As a last component, we define the layout containing camera controls and last gallery photo thumbnail
<FrameLayout xmlns:android="http://schemas.android.com/apk/res/android"
android:layout_width="match_parent"
android:layout_height="match_parent">
<TextureView <!-- (1) -->
android:id="@+id/camera_preview"
... />
<QuadrilateralView <!-- (2) -->
android:id="@+id/quad_view"
... />
<android.support.constraint.ConstraintLayout <!-- (3) -->
android:id="@+id/camera_controls">
...
</android.support.constraint.ConstraintLayout>
</FrameLayout>
This assumes that a TextureView
is being used for the live preview. For SurfaceView
, z-order can be adjusted with the setZOrderMediaOverlay
method.
In order to improve the user experience in low light conditions we offer both torch and flash toggles. These can be enabled via camera parameters Parameters.FLASH_MODE_TORCH
and Parameters.FLASH_MODE_ON
correspondingly. However, many Android devices (most commonly tablets) don’t have a physical LED flash, so we need to check for its presence before displaying the flash and torch icons. Once the user taps on the torch or flash icon, we change the flash mode by calling mCamera.getParameters().setFlashMode()
.
It is important to remember that before changing camera parameters, we need to stop the preview, using mCamera.stopPreview()
, and start it again when we are done, using mCamera.startPreview()
. Not doing this can result in undefined behavior on some devices.
On devices that support it, we use FOCUS_MODE_CONTINUOUS_PICTURE
to make the camera refocus on the subject very aggressively in order to keep the subject sharp at all times. On devices that don’t support it, it can be emulated by requesting autofocus manually on each camera movement, which in turn can be detected using the accelerometer. The supported focus modes can be obtained by calling mCamera.getParameters().getSupportedFocusModes()
In order to receive a callback each time a new frame is available, we need to register a listener.
For TextureView
, we can do this by calling mTextureView.setSurfaceTextureListener
Depending on whether a SurfaceView
or TextureView
has been used, the corresponding callback is either Camera.PreviewCallback
with onPreviewFrame(byte[] data, Camera camera)
invoked each time new frame is available or TextureView.SurfaceTextureListener
with onSurfaceTextureUpdated(SurfaceTexture surface)
method.
Once a SurfaceView
or TextureView
is tied to the camera object, we can start preview by calling mCamera.startPreview()
.
Every time a new frame is available (for most devices, it occurs 20-30 times per second), the callback is invoked.
When onPreviewFrame(byte[] data, Camera camera)
is being used to listen for new frames, it’s important to remember that the new frame will not arrive until we call camera.addCallbackBuffer(mPreviewBuffer)
in order to signal that we are done with processing the buffer and the camera is free to write to it again.
If we use SurfaceTexture
callbacks to receive new frames, onSurfaceTextureUpdated
will be invoked every time new frame is available and it is up us whether it should be processed or discarded.
Our document detector described in the previous blog posts requires the frame, which is later passed to C++ code, to be of specific dimensions and in a specific color space. Specifically, this should be a 200 x 200px frame in RGBA color space. For onPreviewFrame(byte[] data, Camera camera)
, the data
byte array is usually in NV21 format, which is a standard for Android camera preview.
This NV21 frame can be converted to an RGBA bitmap using the following code:
Camera.Parameters = camera.getParameters();
YuvImage yuv = new YuvImage(data, parameters.getPreviewFormat(), width, height, null);
ByteArrayOutputStream out = new ByteArrayOutputStream();
yuv.compressToJpeg(new Rect(0, 0, width, height), 100, out);
byte[] bytes = out.toByteArray();
Bitmap bitmap = BitmapFactory.decodeByteArray(bytes, 0, bytes.length);
The bad news is, using this method, it takes 300-500 ms to process a 1920 x 1080 frame, which makes it absolutely unacceptable for real-time applications.
Fortunately, there are several ways to do this conversion much faster such as using OpenGL/OpenCV or native code. However, there are two RenderScript intrinsic scripts that can provide the requested functionality without having to drop down to lower-level APIs — ScriptIntrinsicResize combined with ScriptIntrinsicYuvtoRGB. By applying these two, we were able to get the processing time down to 10-25 ms thanks to the hardware acceleration.
Things look much simpler when the preview is implemented using TextureView
and onSurfaceTextureUpdated(SurfaceTexture surface)
callback.
This way, we can get the bitmap straight from the TextureView
once a new frame is available:
int expectedImageWidth = pageDetector.getExpectedImageWidth();
int expectedImageHeight = pageDetector.getExpectedImageHeight();
Bitmap bitmap = mTextureView.getBitmap(expectedImageWidth, expectedImageHeight);
TextureView#getBitmap
is generally known to be slow; however, when the dimensions of the requested bitmap are small enough, the processing time is very reasonable (5-15ms for our 200×200 case). While this isn’t a universal solution, it turned out to be both the fastest and the simplest for our application.
Moreover, as we mentioned earlier, the camera sensor orientation is usually either landscape (90 deg) or reverse landscape (270 deg), so the bitmap will most likely be rotated. However, instead of rotating the whole bitmap, it is much faster to rotate the quadrilateral returned by the document detector instead.
On top of the scaled bitmap, our document detector requires passing a so called rotation matrix. Such matrix essentially provides information about phone movement direction (like tilting), which expedites calculating the next position of the quadrilateral. Knowing the coordinates of the quadrilateral at a given time, and the direction in which the device was moved, the document detector can estimate the anticipated future position of the quadrilateral, which speeds up computations.
In order to calculate the rotation matrix, we need to listen for two types of sensor events — Sensor.TYPE_MAGNETIC_FIELD
and Sensor.TYPE_ACCELEROMETER
that represent magnetic and gravity data. Having these, the rotation matrix can be obtained by calling SensorManager.getRotationMatrix
. The document detector is written in C++, hence we need to make the call using JNI.
In case we cannot obtain sensor data, we pass an identity matrix.
Tip: Since calls to the detector can take anywhere from 20-100ms depending on Android device, they cannot be executed in the UI thread. We run them sequentially in a separate thread with elevated priority.
Once the call to document detector returns, we receive coordinates of the four points representing the quadrilateral that delimits the document edges. Understandably, these coordinates apply to the frame that was passed to the detector (e.g. 200×200 square that we mentioned), so we need to scale the coordinates to the original size of the preview. We also need to rotate the quadrilateral in case the camera orientation doesn’t match the orientation of the preview (see step (4) Converting frames, above).
Having received frame coordinates, it is time to draw the quadrilateral over the camera preview (yet below camera controls). For simplicity and better control over z-ordering, we decided to create a custom View
with an overriden onDraw()
method that is responsible for drawing the quad on the canvas. Starting from Android 4.0 (Ice Cream Sandwich), drawing on a canvas is hardware-accelerated by default, which greatly improves performance.
Each time we receive an updated frame, we need to call invalidate()
on the View
. The downside of such an approach is that we have no control over the real refresh rate. To be precise, we don’t know how much time will elapse between us calling invalidate()
and the OS invoking onDraw()
on our view. However, we have measured that this approach allows us to achieve at least 15 FPS on most devices.
Tip: When implementing a custom view, it is very important to keep the
onDraw()
method as lightweight as possible and avoid any expensive operations, such as new object creation.
If drawing using a custom view is too slow, there are many faster, yet more complex solutions such as having another TextureView
or leveraging OpenGL.
We measured the time consumed by each step (in milliseconds) on several Android devices. In each case, the Dropbox app was the only non-preinstalled app. However, since there are many different factors that influence the performance (e.g. phone movements), these results cannot be treated as a benchmark and are here solely for illustrative purposes.
[
](https://dropboxtechblog.files.wordpress.com/2017/05/09-file.png)
Timings for one full cycle of the preview process on various devices
Note that faster devices usually have better cameras, so there is also more data to process. The worst case scenario for the document scanner would be a slow device with a very high resolution camera.
The thumbnail we display in the lower left corner allows a user to preview the last gallery item. Tapping on it takes the user to the phone’s camera roll, where an existing photo can be selected for scanning.
[
](https://dropboxtechblog.files.wordpress.com/2017/05/10-gallery_scan_final.gif)
Using an existing photo in the doc scanner
The last available thumbnail (if any) can be retrieved using the following query:
String[] projection =
new String[] {
ImageColumns._ID, ImageColumns.DATA, ImageColumns.DATE_TAKEN,
};
Cursor cursor =
getContentResolver()
.query(
MediaStore.Images.Media.EXTERNAL_CONTENT_URI,
projection,
null,
null,
ImageColumns.DATE_TAKEN + " DESC");
Tip: To ensure proper orientation of the thumbnail (and a full-size photo), we need to read and interpret its
ExifTags
correctly. This can be achieved usingandroid.media.ExifInterface
class. There are 8 different tags representing orientation that need to be interpreted.
If the cursor is empty (there are no photos in the gallery yet) or retrieving the bitmap threw an error (such as getting a null bitmap or exception), we simply hide the preview and make scanning from the gallery unavailable.
Try out the Android Dropbox doc scanner today, and stay tuned for a future doc scanner post where we will describe the challenges in creating a multi-page PDF from a set of captured pages.
https://blogs.dropbox.com/tech/2016/08/fast-document-rectification-and-enhancement/
Jongmin Baek August 16, 2016
0
Dropbox’s document scanner lets users capture a photo of a document with their phone and convert it into a clean, rectangular PDF. It works even if the input is rotated, slightly crumpled, or partially in shadow—but how?
In our previous blog post, we explained how we detect the boundaries of the document. In this post, we cover the next parts of the pipeline: rectifying the document (turning it from a general quadrilateral to a rectangle) and enhancing it to make it evenly illuminated with high contrast. In a traditional flatbed scanner, you get all of these for free, since the document capture environment is tightly controlled: the document is firmly pressed against a brightly-lit rectangular screen. However, when the camera and document can both move freely, this becomes a much tougher problem.
We would like our scans to be easy to read, no matter the conditions in which they were captured. We define a pleasing scan to have the following properties:
Here’s an example input and output:
We assume that the input document is rectangular in the physical world, but if it is not exactly facing the camera, the resulting corners in the image will be a general convex quadrilateral. So to satisfy our first goal, we must undo the geometric transform applied by the capture process. This transformation depends on the viewpoint of the camera relative to the document (these are the so-called extrinsic parameters), in addition to things like the focal length of the camera (the intrinsic parameters). Here’s a diagram of the capture scenario:
In order to undo the geometric transform, we must first determine the said parameters. If we assume a nicely symmetric camera (no astigmatism, no skew, et cetera), the unknowns in this model are:
On the flip side, the x- and y-coordinates of the four detected document corners gives us effectively eight constraints. While there are seemingly more unknowns (9) than constraints (8), the unknowns are not entirely free variables—one could imagine scaling the document physically and placing it further from the camera, to obtain an identical photo. This relation places an additional constraint, so we have a fully constrained system to be solved. (The actual system of equations we solve involves a few other considerations; the relevant Wikipedia article gives a good summary.)
Once the parameters have been recovered, we can undo the geometric transform applied by the capture process to obtain a nice rectangular image. However, this is potentially a time-consuming process: one would look up, for each output pixel, the value of the corresponding input pixel in the source image. Of course, GPUs are specifically designed for tasks like this: rendering a texture in a virtual space. There exists a view transform—which happens to be the inverse of the camera transform we just solved for!—with which one can render the full input image and obtain the rectified document. (An easy way to see this is to note that once you have the full input image on the screen of your phone, you can tilt and translate the phone such that the projection of the document region on the screen appears rectilinear to you.)
Lastly, recall that there was an ambiguity with respect to scale: we can’t tell whether the document was a letter-sized paper (8.5” x 11”) or a poster board (17” x 22”), for instance. What should the dimensions of the output image be? To resolve this ambiguity, we count the number of pixels within the quadrilateral in the input image, and set the output resolution as to match this pixel count. The idea is that we don’t want to upsample or downsample the image too much.
Once we have a rectangular rendering of the document, the next step is to give it a clean and crisp scanned appearance. We can explicitly formulate this as an optimization problem; that is, we solve for the final output image J(x,y) as a function of the input image I(x, y) that satisfies the two aforementioned requirements to the greatest extent possible:
If we could tell whether a given pixel belongs to the foreground or to the background, this task would be straightforward. However, assigning a binary label leads to aliasing, especially for text with small font. A simple linear transform based on the pixel value is not sufficient, either, because there are often shadows or other lighting variations across the image. Hence, we will try to compute the final output image J without explicitly solving the foreground/background classification problem.
We achieve the above requirements by writing a cost function that penalizes things we don’t want, and then running it through a standard optimization procedure to arrive at the solution with the lowest cost possible; hopefully, this will correspond to the best possible output image.
So now, the degree to which a potential solution J adheres to the first requirement is fairly straightforward to write as a cost:
where 255 denotes white pixels and the indices x, y range over the extent of the image. If the output image is mostly white, this measure would be minimized.
For the second requirement, we’d like to ensure that the foreground has a crisp contrast against the background for ease of reading, despite changes in brightness throughout the image. Since we are not explicitly assigning foreground labels, what we need is a way to preserve local structure while factoring out global brightness changes. One common measure of the local structure within an image is its gradient, which denotes the difference between neighboring pixels. Hence, to preserve the local structure, we can use as our cost the degree to which the output gradient deviates from that of the original image:
Combining the two, we obtain an optimization problem we can tackle:
This yields a well-known system of equations called Poisson’s equation, which is commonly used in computer graphics and physics. It can be solved efficiently via either conjugate gradient descent or the fast Fourier transform. We make use of Apple’s Accelerate framework and open-source template meta-programming libraries, such as Eigen and our own Lopper, for further accelerating the computation.
Solving Poisson’s equation on a full-resolution image (8–12 megapixels on the latest iPhones) is still computationally demanding, and can take several seconds on older devices. If the user is creating a multi-page PDF, the wait time increases commensurately. To provide a smoother user experience, we would like to reduce the processing time by an order of magnitude.
One observation is that the output is generally linearly correlated with the input, at least locally—if one were to apply some gain to the input and add an offset, it would be a reasonably good solution locally, i.e.,
Of course, the rationale behind using a mathematical machinery like the Poisson’s equation in the first place was that there is no single gain and offset that works for the whole image. In order to handle uneven illuminations and shadows, however, we could allow the gain and the offset to vary across the image:
While this new formulation is more flexible than before, it has twice as many unknowns (the gain and the offset at each pixel, rather than simply the final output value), making it trickier and more expensive to solve.
The key insight for reducing the computational cost and further constraining the problem is that the gain and the offset should vary relatively slowly across the image—we’re aiming to deal with illumination changes, not rainbow-colored paper! This allows us to solve the optimization problem at a much lower resolution compared to the input image, and therefore much faster. This also implicitly forces the unknown values to correlate locally, because we then upsample the gain and offset back to the original resolution. Once the gain and the offset are known across the image, we can plug them back into the above equation to obtain the final output image.
So far our derivations have ignored color, even though most photos come in the form of an RGB image. The simplest way to deal with this is to apply the above algorithm to each of the R, G, B channels independently, but this can result in color shifts, since the channels are no longer constrained together.
In our initial effort to combat this, we tried to substitute in the original RGB values for the output pixels that are not close to white. However, when we tried this, we encountered the effect of color constancy. Here’s a great illustration of this “illusion,” in which the two tiles marked A and B have the same pixel values, but appear to be very different:
[
](https://commons.wikimedia.org/wiki/File:Grey_square_optical_illusion.PNG)
In our experiments, the output colors using this simple algorithm would look faded, even though the RGB values were exactly the same as the input! The reason is that the human visual system is based on relative brightness, not absolute ones; this makes colors “pop” more relative to the dull gray of the input, but not relative to the bright white background of the enhanced image. To deal with this effect, our algorithm converts the image into the HSV color space, and then copies the hue and saturation from the original image wherever appropriate. This results in much better color constancy, as seen below:
[
Left: the original image. Middle: an enhanced image, with the background becoming white and the foreground preserved in exact R, G, B values. Note that the colors appear faded. Right: an enhanced image that tries to correct for the perceptual discrepancy.
So far we have assumed that our document detection and rectification steps (which precede enhancement) work perfectly. While they provide reasonable results on most inputs, sometimes they have to operate on input images that violate our assumptions: In practice, many documents being scanned have edges that are not perfectly straight, and sometimes the corners are dog-eared. As a consequence, the rectified image may include some background that is not a part of the document. We detect and segment out these areas via a simple min-cut algorithm, so that the final output is more reflective of the user intent; the algorithm removes well-delineated dark areas near the borders.
Left: an enhanced document image, without treating the boundaries. Because the document edges may be physically curved or dog-eared, the rectified image may contain some non-document regions. Right: an enhanced document image, with the boundaries treated.
Starting from a document boundary, we have shown how we rectify the image and then enhance it for readability. Our enhancement algorithm also contains a single free parameter that controls the contrast of the output document (plus the relative values of the coefficients introduced above, like k1 and k2), and we make it adjustable with a user-friendly interface so that you can get the exact appearance you want, as shown here:
Try out the Dropbox document scanner today, and stay tuned for our next blog post.
https://blogs.dropbox.com/tech/2016/08/fast-and-accurate-document-detection-for-scanning/
Ying Xiong August 9, 2016
0
A few weeks ago, Dropbox launched a set of new productivity tools including document scanning on iOS. This new feature allows users to scan documents with their smartphone camera and store those scans directly in their Dropbox. The feature automatically detects the document in the frame, extracts it from the background, fits it to a rectangular shape, removes shadows and adjusts the contrast, and finally saves it to a PDF file. For Dropbox Business users, we also run Optical Character Recognition (OCR) to recognize the text in the document for search and copy-pasting.
Beginning today, we will present a series of technical blog posts describing the computer vision and machine learning technologies that make Dropbox’s document scanning possible. In this post, we’ll focus on the first part of the pipeline: document detection.
The goal of document detection is to find the corners and edges of a document in the image, so that it can be cropped out from the background. Ideally, detection should happen in real time, so that the user can interactively move the camera to capture the best image possible. This requires the detector to run really fast (100ms per frame or less) on a tight CPU and memory budget.
A common approach to solving problems like this is to train a deep neural network (DNN). DNNs are algorithms that take a large amount of labeled data and automatically learn to predict labels for new inputs. These have proved to be tremendously successful for a variety of computer vision applications, including image classification, image captioning, and face detection. However, DNNs are quite expensive, both in terms of computation time and memory usage. Therefore, they are usually difficult to deploy on mobile devices.
Another potential solution is to use Apple’s rectangle detection SDK, which provides an easy-to-use API that can identify rectangles in still images or video sequences in near-realtime. The algorithm works very well in simple scenes with a single prominent rectangle in a clean background, but is less accurate in more complicated scenes, such as capturing small receipts or business cards in cluttered backgrounds, which are essential use-cases for our scanning feature.
We decided to develop a customized computer vision algorithm that relies on a series of well-studied fundamental components, rather than the “black box” of machine learning algorithms such as DNNs. The advantages of this approach are that it is easier to understand and debug, needs much less labeled training data, runs very fast and uses less memory at run time. It is also more accurate than Apple’s SDK for the kinds of usage scenarios we care about; in an A/B test evaluation, the detections found by our algorithm are 60% less likely to be manually corrected by users than those found by Apple’s API.
Our first observation is that documents are usually rectangular-shaped in physical space, and turn into convex quadrilaterals when projected onto 2D images. Therefore, our goal turns into finding the “best” quadrilateral from the image, and use that as our proxy for the document boundary. In order to find the quadrilateral, we need to find straight lines and their intersections. Finally, to find straight lines, we need to detect strong edges in the image. This gives us the outline of our detection algorithm, as shown below. We will discuss each component in more detail next.
Document detection pipeline
Finding edges in an image is a classic problem in image processing and computer vision. It has decades of history, and saw early success already in the ’80s. One of the best known methods is the Canny edge detector, named after its inventor, John Canny. It dates back to 1986 but is still widely used today.
We applied the Canny Detector to our input image, as shown below, but the results were not very promising. The main problem is that the sections of text inside the document are strongly amplified, whereas the document edges—what we’re interested in—show up very weakly.
[
[
Left: the input image. Right: the output of the Canny edge detector.
To overcome these shortcomings, we used a modern machine learning-based algorithm. The algorithm is trained on images where humans annotate the most significant edges and object boundaries. Given this labeled dataset, a machine learning model is trained to predict the probability of each pixel in an image belonging to an object boundary.
The result of this learning-based edge detector is shown below. It’s much better at focusing on the document edges that we care about.
[
[
Left: the input image. Right: the output of the machine learning-based edge detector.
Once we have an accurate edge map, we’d like to find straight lines in it. For this, we use the venerable Hough transform, a technique that lets individual data points “vote” for likely solutions to a set of equations. In our case, each detected edge pixel votes for all lines passing through that point; the hope is that by adding up the votes across all edge pixels, the true document boundaries will emerge with the most votes.
More formally, here’s how it works: The slope-intercept form of a line is y = mx + b. If we detect an edge pixel at a particular (x,y) point, we want to vote for all lines that pass through the point. This corresponds to all slopes m and intercepts b that satisfy the line equation for that point. So we set up a “Hough Space” with m and b axes. Here, a single point (m,b) corresponds to a line in the original image; conversely, a point in the original image space corresponds to a line in the Hough Space. (This is called a duality in mathematics.) For every edge pixel in the original image, we increment a count for all corresponding points in the Hough Space. Finally, we simply look for the points with most votes in the Hough Space, and convert those back into lines in the original space.
In the figure below, you can see the detected edge pixels on the left and the corresponding Hough Space in the middle. We’ve circled the points with the most votes in the Hough Space, and then converted them back into lines (overlaid onto the original image) on the right. Note that although we described the Hough Transform above in terms of the slope-intercept form of a line, in practice we use a polar parameterization, r=x·sinθ+y·cosθ, that is more robust and easier to work with.
[
[
[
Left: detected edges. Middle: the Hough Transform of the edges, with local maxima marked in red. Right: the lines corresponding to the local maxima overlaid onto the original image.
After finding straight lines, the rest of the work is relatively simple. We compute the intersections between the lines as potential document corners, with some simple geometric constraints. For example, intersections with very acute angles are unlikely to be document corners. We next iterate through potential document corners, and enumerate all possible quadrilaterals, each of which is scored by adding up the probability predicted by the edge detector over pixels along its perimeter. The quadrilateral with highest score is output as the detected document.
[
[
](https://blogs.dropbox.com/tech/2016/08/fast-and-accurate-document-detection-for-scanning/quad2-1/)
[
](https://blogs.dropbox.com/tech/2016/08/fast-and-accurate-document-detection-for-scanning/quad3-1/)
Left: intersections of detected lines are potential document corners, although the red ones are filtered out by using geometric constraints. Middle: one possible quadrilateral formed by the potential corners. Right: the quadrilateral with the highest score, which is the output of our algorithm.
Finally, we show a video below demonstrating each step of the pipeline. The video is generated with a standalone iOS app we built to develop, visualize and debug our algorithm. The full pipeline runs near realtime at about 8–10 frames per second.
Visualization of all steps in the detection algorithm.
Try out the Dropbox doc scanner today, and stay tuned for our next blog post, where we’ll describe how we turn the detected document outline into an enhanced rectangular image.
https://blogs.dropbox.com/tech/2016/10/improving-the-responsiveness-of-the-document-detector/
Jongmin Baek October 19, 2016
0
In our previous blog posts (Part 1, Part 2), we presented an overview of various parts of Dropbox’s document scanner, which helps users digitize their physical documents by automatically detecting them from photos and enhancing them. In this post, we will delve into the problem of maintaining a real-time frame rate in the document scanner even in the presence of camera movement, and share some lessons learned.
Dropbox’s document scanner shows an overlay of the detected document over the incoming image stream from the camera. In some sense, this is a rudimentary form of augmented reality. Of course, this isn’t revolutionary; many apps have the same form of visualization. For instance, many camera apps will show a bounding box around detected faces; other apps show the world through a color filter, a virtual picture frame, geometric distortions, and so on.
One constraint is that the necessary processing (e.g., detecting documents, detecting and recognizing faces, localizing and classifying objects, and so on) does not happen instantaneously. In fact, the fancier one’s algorithm is, the more computations it needs and the slower it gets. On the other hand, the camera pumps out images at 30 frames per second (fps) continuously, and it can be difficult to keep up. Exacerbating this is the fact that not everyone is sporting the latest, shiniest flagship device; algorithms that run briskly on the new iPhone 7 will be sluggish on an iPhone 5.
We ran into this very issue ourselves: the document detection algorithm described in our earlier blog post could run in real-time on the more recent iPhones, but struggled on older devices, even after leveraging vectorization (performing many operations simultaneously using specialized hardware instructions) and GPGPU (offloading some computations to the graphics processor available on phones). In the remaining sections, we discuss various approaches for reconciling the slowness of algorithms with the frame rate of the incoming images.
Let’s assume from here on that our document detection algorithm requires 100ms per frame on a particular device, and the camera yields an image every 33 ms (i.e., 30 fps). One straightforward approach is to run the algorithm on a “best effort” basis while displaying all the images, as shown in the diagram below.
The diagram shows the relative timings of various events associated with a particular image from the camera, corresponding to the “Capture Event” marked in gray. As you can see, the image is displayed for 33 ms (“Image Display”) until the next image arrives. Once the document boundary quadrilateral is detected (“Quad Detection”), which happens 100 ms after the image is received, the detected quad is displayed for the next 100 ms (“Quad Display”) until the next quad is available. Note that in the time the detection algorithm is running, two _mor_e images are going to be captured and displayed to the user, but their quads are never computed, since the quad-computing thread is busy.
The major benefit of this approach is that the camera itself runs at its native speed—with no external latency and at 30 fps. Unfortunately, the quad on the screen only updates at 10 fps, and even worse, is offset from the image from which it is computed! That is, by the time the relevant quad has been computed, the corresponding image is no longer on screen. This results in laggy, choppy quads on screen, even though the images themselves are buttery smooth, as shown in the animated GIF below.
Another approach is to serialize the processing and to skip displaying images altogether when we are backed up, as shown in the next diagram. Once the camera captures an image and sends it to our app (“Capture Event”), we can run the algorithm (“Quad Detection”), and when the result is ready, display it on the screen (“Quad Display”) along with the source image (“Image Display”). While the algorithm is busy, additional images that arrive from the camera are dropped.
In contrast to the first approach, the major benefit here is that the quad will always be synced to the imagery being displayed on the screen, as shown in the first animated GIF below.
Unfortunately, the camera now runs at reduced frame rate (10 fps). What’s more disruptive, however, is the large latency (100 ms) between the physical reality and the viewfinder. This is not visible in the GIF alone, but to a user who is looking at both the screen and the physical document, this temporal misalignment will be jarring and is a well-known issue for VR headsets.
The two approaches described thus far have complementary strengths and weaknesses: it seems like you can either get smooth images OR correct quads, but not both. Is that true, though? Perhaps we can get the best of both worlds?
A good rule of thumb in performance is to not do the same thing twice, and this adage applies aptly in video processing. In most cases, camera frames that are adjacent temporally will contain very similar data, and this prior can be exploited as follows:
While this is a promising simplification that turns our original detection problem into a tracking problem, robustly computing the transformation between two images is a nontrivial and slow exercise on its own. We experimented with various approaches (brute-forcing, keypoint-based alignment with RANSAC, digest-based alignment), but did not find a satisfactory solution that was fast enough.
In fact, there is an even stronger prior than what we claimed above; the two images we are analyzing are not just any two images! Each of these images, by stipulation, contains a quad, and we already have the quad for the first image. Therefore, it suffices to figure out where in the second image this particular quad ends up. More formally, we try to find the transform of this quad such that the edge response of the hypothetical new quad, defined to be the line integral of the gradient of the image measured perpendicular to the perimeter of the quad, is maximized. This measure optimizes for strong edges across the boundaries of the document.
See the appendix below for a discussion on how to solve this efficiently.
Theoretically, we could now run detection only once and then track from there on out. However, this would cause any error in the tracking algorithm to accumulate over time. So instead, we continue to run the quad detector as before, in a loop—it will now take slightly over 100 ms, given the extra compute we are performing—to provide the latest accurate estimate of the quad, but also perform quad tracking at the same time. The image is held until this (quick) tracking process is done, and is displayed along with the quad on the screen. Refer to the diagram below for details.
In summary, this hybrid processing mode combines the best of both asynchronous and synchronous modes, yielding a smooth viewfinder with quads that are synced to the viewfinder, at the cost of a little bit of latency. The table below compares the three methods:
Asynchronous
Synchronous
Hybrid
Image throughput
30 Hz
10 Hz
30 Hz
Image latency
0 ms
100 ms
~30 ms
Quad throughput
10 Hz
10 Hz
30 Hz
Quad latency
100 ms
100 ms
~30 ms
Image vs quad offset
100 ms
0 ms
0 ms
The GIF below compares the hybrid processing (in blue) and the asynchronous processing (in green) on an iPhone 5. Notice how the quad from the hybrid processing is both correct and fast.
In practice, we observed that the most common camera motions in the viewfinder are panning (movement parallel to the document surface), zooming (movement perpendicular to the document surface), and rolling (rotating on a plane parallel to the document surface.) We rely on the onboard gyroscope to compute the roll of the camera between consecutive frames, which can then be factored out, so the problem is reduced to that of finding a scaled and translated version of a particular quadrilateral.
In order to localize the quadrilateral in the current frame, we need to evaluate the aforementioned objective function on each hypothesis. This involves computing a line integral along the perimeter, which can be quite expensive! However, as shown in the figure below, the edges in all hypotheses can have only one of four possible slopes, defined by the four edges of the previous quad.
Exploiting this pattern, we precompute a sheared running sum across the entire image, for each of the four slopes. The diagram below shows two of the running sum tables, with each color indicating the set of pixel locations that are summed together. (Recall that we sum the gradient perpendicular to the edge, not the pixel values.)
Once we have the four tables, the line integral along the perimeter of any hypothesis can be computed in O(1): for each edge, look up the running sums at the endpoints in the corresponding table, and calculate the difference in order to get the line integral over the edge, and then sum up the differences for four edges to yield the desired response. In this manner, we can evaluate the corresponding hypotheses for all possible translations and a discretized set of scales, and identify the one with the highest response. (This idea is similar to the integral images used in the Viola-Jones face detector.)
Try out the Dropbox doc scanner today, and stay tuned for our next blog post.
In this put up we can make a selection you at the attend of the scenes on how we constructed a affirm-of-the-art Optical Persona Recognition (OCR) pipeline for our cell epic scanner. We outmoded laptop imaginative and prescient and deep finding out advances such as bi-directional Lengthy Fast Timeframe Memory (LSTMs), Connectionist Temporal Classification (CTC), convolutional neural nets (CNNs), and more. Apart from, we can additionally dive deep into what it took to surely fabricate our OCR pipeline production-exciting at Dropbox scale.
In old posts we be pleased described how Dropbox’s cell epic scanner works. The epic scanner makes it that you just may perhaps perhaps perhaps perhaps imagine to make employ of your cell cell phone to opt photos and “scan” objects take care of receipts and invoices. Our cell epic scanner most advantageous outputs an image — any textual sing within the image is correct a place of pixels up to now as the computer is concerned, and can’t be reproduction-pasted, searched for, or any of the different stuff you may perhaps perhaps perhaps perhaps make with textual sing.
Hence the favor to practice Optical Persona Recognition, or OCR. This activity extracts trusty textual sing from our doc-scanned image. As soon as OCR is speed, we can then enable the following capabilities for our Dropbox Industry users:
After we constructed the first version of the cell epic scanner, we outmoded a industrial off-the-shelf OCR library, in deliver to make product validation before diving too deep into creating our safe machine finding out-basically based totally OCR system. This supposed integrating the industrial system into our scanning pipeline, offering both capabilities above to our enterprise users to learn about within the event that they chanced on ample employ from the OCR. When we confirmed that there used to be certainly sturdy user query for the cell epic scanner and OCR, we made up our minds to manufacture our safe in-house OCR system for rather a lot of reasons.
First, there used to be a impress consideration: having our safe OCR system would save us valuable money as the licensed industrial OCR SDK charged us in step with the option of scans. 2nd, the industrial system used to be tuned for the primitive OCR world of photos from flat mattress scanners, whereas our working scenario used to be well-known more difficult, attributable to cell cell phone photos are well-known more unconstrained, with crinkled or bent documents, shadows and uneven lighting fixtures, blurriness and reflective highlights, and diverse others. Thus, there may perhaps well very nicely be a chance for us to toughen recognition accuracy.
In actual fact, a sea change has took affirm on this planet of laptop imaginative and prescient that gave us a special opportunity. Historically, OCR systems had been carefully pipelined, with hand-constructed and extremely-tuned modules taking edifying thing about every selection of conditions they may perhaps well lift to be honorable for photos captured the utilization of a flatbed scanner. For example, one module may perhaps perhaps procure lines of textual sing, then the next module would procure words and section letters, then yet any other module may perhaps perhaps practice totally different tactics to every share of a character to resolve out what the character is, and diverse others. Most solutions count on binarization of the enter image as an early stage, and this is also brittle and discards crucial cues. The activity to manufacture these OCR systems used to be very surely edifying and labor intensive, and the systems may perhaps perhaps on the total most advantageous work with rather constrained imagery from flat mattress scanners.
The previous couple of years has seen the successful utility of deep finding out to a enormous option of problems in laptop imaginative and prescient which be pleased given us well-known recent instruments for tackling OCR and not utilizing a must copy the advanced processing pipelines of the previous, relying as a alternative on enormous quantities of data to be pleased the system robotically discover learn how to make relatively just a few the previously manually-designed steps.
Most likely the most fundamental explanation for constructing our safe system is that it may perhaps perhaps give us more adjust over safe destiny, and enable us to work on more modern capabilities within the prolonged speed.
Within the rest of this blog put up we can make a selection you at the attend of the scenes of how we constructed this pipeline at Dropbox scale. Most industrial machine finding out projects be aware three major steps:
We can make a selection you via every of those steps in flip.
Our preliminary job used to be to learn about if we would even fabricate a affirm-of-the-art OCR system in any respect.
We began by collecting a representative place of donated epic photos that match what users may perhaps perhaps upload, such as receipts, invoices, letters, and diverse others. To rating this place, we asked a limited proportion of users whether or no longer they’d donate some of their image files for us to toughen our algorithms. At Dropbox, we make a selection user privateness very critically and thus made it effective that this used to be totally non-compulsory, and if donated, the files will be kept non-public and stable. We employ a huge diversity of safety precautions with such user-donated files, together with by no contrivance preserving donated files on native machines in eternal storage, asserting in depth auditing, requiring sturdy authentication to procure admission to any of it, and more.
One other crucial, machine finding out-specific ingredient for user-donated files is learn how to mark it. Most most fresh machine finding out tactics are strongly-supervised, which contrivance that they require advise manual labeling of enter files so as that the algorithms can learn to manufacture predictions themselves. Historically, this labeling is carried out by exterior workers, on the total the utilization of a micro-work platform such as Amazon’s Mechanical Turk (MTurk). On the different hand, a downside to the utilization of MTurk is that every merchandise may perhaps well very nicely be seen and labeled by a totally different worker, and we absolutely don’t desire to repeat user-donated files within the wild take care of this!
Thus, our crew at Dropbox created our safe platform for files annotation, named DropTurk. DropTurk can post labeling jobs either to MTurk (if we’re facing public non-user files) or a limited pool of employed contractors for user-donated files. These contractors are below a strict non-disclosure agreement (NDA) to make certain they’ll not protect or section any of the info they mark. DropTurk features a venerable list of annotation job UI templates that we can without warning assemble and customize for new datasets and labeling projects, which enables us to annotate our datasets relatively quick.
For example, here is a DropTurk UI supposed to make ground truth files for individual be aware photos, together with concept to be one of many following alternatives for the workers to quit:
DropTurk UI for adding ground truth files for be aware photos
Our DropTurk platform involves dashboards to procure an overview of previous jobs, gaze the progress of most fresh jobs, and procure admission to the outcomes securely. Apart from, we can procure analytics to assess workers’ performance, even getting worker-stage graphical monitoring of annotations of ongoing jobs to get rid of doubtless components early on:
DropTurk Dashboard
The utilization of DropTurk, we restful both a be aware-stage dataset, which has photos of individual words and their annotated textual sing, as nicely as a fleshy epic-stage dataset, which has photos of fleshy documents (take care of receipts) and absolutely transcribed textual sing. We outmoded the latter to measure the accuracy of existing affirm-of-the-art OCR systems; this may perhaps perhaps then present our efforts by telling us the gain we may perhaps well be pleased to meet or beat for our safe system. On this advise dataset, the accuracy proportion we needed to make used to be within the mid-90s.
Our first job used to be to resolve if the OCR reveal used to be even going to be solvable in an cheap amount of time. So we broke the OCR reveal into two pieces. First, we may perhaps well employ laptop imaginative and prescient to opt an image of a epic and section it into lines and words; we name that the Notice Detector. Then, we may perhaps well make a selection every be aware and feed it into a deep net to flip the be aware image into trusty textual sing; we name that the Notice Deep Catch.
We felt that the Notice Detector will be relatively easy, and so focused our efforts first on the Notice Deep Catch, which we had been less obvious about.
The Notice Deep Catch combines neural network architectures outmoded in laptop imaginative and prescient and computerized speech recognition systems. Photography of cropped words are fed into a Convolutional Neural Catch (CNN) with rather a lot of convolutional layers. The visible capabilities which are output by the CNN are then fed as a series to a Bidirectional LSTM (Lengthy Fast Timeframe Memory) — same outdated in speech recognition systems — which fabricate sense of our be aware “pieces,” and lastly arrives at a textual sing prediction the utilization of a Connectionist Temporal Classification (CTC) layer. Batch Normalization is outmoded where acceptable.
OCR Notice Deep Catch
When we had made up our minds on this network architecture for turning an image of a single be aware into textual sing, we then wished to resolve out learn how to amass ample files to sing it. Deep finding out systems on the total need huge amounts of practicing files to make exact recognition performance; essentially, the amount of practicing files is on the total the most fundamental bottleneck in most fresh systems. Customarily, all this data have to be restful and then labeled manually, a time-intelligent and costly activity.
An different is to programmatically generate practicing files. On the different hand, in most laptop imaginative and prescient problems it’s at this time too advanced to generate life like-ample photos for practicing algorithms: the variety of imaging environments and transformations is too varied to effectively simulate. (One promising space of most fresh be taught is Generative Adversarial Networks (GANs), which appear to be nicely-edifying to generating life like files.) Fortunately, our reveal on this case is a noteworthy match for the utilization of man-made files, for the explanation that kinds of photos we favor to generate are relatively constrained and can thus be rendered robotically. Not like photos of pure or most manmade objects, documents and their textual sing are artificial and the variability of individual characters is relatively restricted.
Our artificial files pipeline includes three pieces:
The generation algorithm simply samples from every of those to impress a special practicing instance.
[
Synthetically generated be aware photos
We started simply with all three, with words coming from a assortment of Mission Gutenberg books from the Nineteenth century, just a few thousand fonts we restful, and some easy distortions take care of rotations, underlines, and blurs. We generated just a few million artificial words, educated our deep net, and then examined our accuracy, which used to be around seventy 9%. That used to be k, but no longer exact ample.
Through many iterations, we evolved every share of our artificial files pipeline in some solutions to toughen the recognition accuracy. Some highlights:
Synthetically generated words the utilization of totally different thermal printer fonts, same outdated in receipts
[
Synthetically generated detrimental practicing examples
Fake shadow quit
Data is as crucial as the machine finding out mannequin outmoded, so we spent a big deal of time refining this data generation pipeline. Sooner or later, we can open supply and liberate this synthetically generated files for others to sing and validate their safe systems and be taught on.
We educated our network on Amazon EC2 G2 GPU conditions, spinning up many experiments in parallel. All of our experiments went into a lab pocket guide that incorporated the entire lot crucial to copy experiments so we would song unexpected accuracy bumps or losses.
<!-- google_ad_client = "pub-2151667935419035"; / 728x90, created 7/17/08 / google_ad_slot = "9901119915"; google_ad_width = 728; google_ad_height = 90; //-->
Our lab pocket guide contained numbered experiments, with the most most fresh experiment first. It tracked the entire lot wished for machine finding out reproducibility, such as a special git hash for the code that used to be outmoded, pointers that will perhaps S3 with generated files sets and results, analysis results, graphs, a high-stage description of the draw of that experiment, and more. As we constructed our artificial files pipeline and educated our network, we additionally constructed many special motive instruments to visualise fonts, debug network guesses, and diverse others.
[
Example early experiment tracking error price vs. how prolonged our Notice Deep Catch had educated, towards an analysis dataset that consisted of exact single words (Single Notice Accuracy)
Our early experiments tracked how nicely Notice Deep Catch did on OCR-ing photos of single words, which we known as Single Notice Accuracy (SWA). Accuracy on this context supposed how relatively just a few the ground truth words the deep net got edifying. Apart from, we tracked precision and take for the network. Precision refers to the a part of words returned by the deep net that had been essentially edifying, whereas take refers to the a part of analysis files that is appropriately predicted by the deep net. There tends to be a tradeoff between precision and take.
For example, imagine we be pleased a machine finding out mannequin that is designed to categorise an e-mail as junk mail or no longer. Precision will be whether or no longer the total issues that had been labeled as junk mail by the classifier, how many had been essentially junk mail? Take, in contrast, will be whether or no longer of the total issues that surely are junk mail, how many did we mark? It is far that you just may perhaps perhaps perhaps perhaps imagine to appropriately mark junk mail emails (high precision) whereas no longer essentially labeling the total honorable junk mail emails (low take).
Week over week, we tracked how nicely we had been doing. We divided our dataset into totally different categories, such as register_tapes
(receipts), screenshots
, scanned_docs
, and diverse others., and computed accuracies both individually for every class and total all the device in which through all files. For example, the entry below shows early work in our lab pocket guide for our first fleshy conclude-to-conclude test, with an trusty Notice Detector coupled to our steady Notice Deep Catch. It is doubtless you’ll perhaps also learn about that we did rather terribly at the begin:
[
Screenshot from early conclude-to-conclude experiments in our lab pocket guide
At a determined point our artificial files pipeline used to be ensuing in a Single Notice Accuracy (SWA) proportion within the high-80s on our OCR benchmark place, and we made up our minds we had been carried out with that share. We then restful about 20,000 steady photos of words (when put next to our 1 million synthetically generated words) and outmoded these to magnificent tune the Notice Deep Catch. This took us to an SWA within the mid-90s.
We now had a system that will perhaps make very nicely on individual be aware photos, but of direction an trusty OCR system operates on photos of entire documents. Our subsequent step used to be to specialize within the epic-stage Notice Detector.
For our Notice Detector we made up our minds to no longer employ a deep net-basically based totally contrivance. The major candidates for such approaches had been object detection systems, take care of RCNN, that attempt to detect the areas (bounding boxes) of objects take care of canines, cats, or vegetation from photos. Most photos most advantageous be pleased per chance one to 5 conditions of a given object.
On the different hand, most documents don’t exact be pleased a handful of words — they’ve hundreds and even hundreds of them, i.e., just a few orders of magnitude more objects than most neural network-basically based totally object detection systems had been edifying of finding at the time. We had been thus no longer obvious that such algorithms would scale as a lot as the stage our OCR system wished.
One other crucial consideration used to be that primitive laptop imaginative and prescient approaches the utilization of draw detectors may perhaps well very nicely be more uncomplicated to debug, as neural networks as notoriously opaque and be pleased inner representations which are exhausting to know and clarify.
We ended up the utilization of a common laptop imaginative and prescient contrivance named Maximally Precise Extremal Regions (MSERs), the utilization of OpenCV’s implementation. The MSER algorithm finds connected regions at totally different thresholds, or levels, of the image. In actual fact, they detect blobs in photos, and are thus critically exact for textual sing.
Our Notice Detector first detects MSER capabilities in an image, then strings these together into be aware and line detections. One fascinating part is that our be aware deep net accepts mounted size be aware image inputs. This requires the be aware detector to thus in most cases contain more than one be aware in a single detection field, or slit a single be aware in half if it is too prolonged to fit the deep net’s enter size. Data on this lowering then have to be propagated through all the pipeline, so as that we can re-assemble it after the deep net has speed. One other bit of trickiness is facing photos with white textual sing on darkish backgrounds, versus darkish textual sing on white backgrounds, forcing our MSER detector as a device to manage with both eventualities.
When we had sophisticated our Notice Detector to an appropriate point, we chained it alongside with our Notice Deep Catch so as that we’d benchmark all the blended system conclude-to-conclude towards epic-stage photos relatively than our older Single Notice Accuracy benchmarking suite. On the different hand, after we first measured the conclude-to-conclude accuracy, we chanced on that we had been performing around Forty four% — relatively relatively worse than the competition.
The major components had been spacing and fake garbage textual sing from noise within the image. Now and again we may perhaps well incorrectly combine two words, such as “helloworld”, or incorrectly fragment a single be aware, such as “wo rld”.
Our resolution used to be to regulate the Connectionist Temporal Classification (CTC) layer of the network to additionally give us a self belief gain to boot to the predicted textual sing. We then outmoded this self belief gain to bucket predictions in Three solutions:
We additionally needed to manage with components precipitated by the previously talked about mounted receptive image size of the Notice Deep Catch: particularly, that a single “be aware” window may perhaps perhaps essentially safe rather a lot of words or most advantageous section of a very prolonged be aware. We thus speed these outputs alongside with the fresh outputs from the Notice Detector through a module we name the Wordinator, which affords discrete bounding boxes for every individual OCRed be aware. This finally ends up in individual be aware coordinates alongside with their OCRed textual sing.
For example, within the following debug visualization from our system you may perhaps perhaps perhaps perhaps learn about boxes around detected words before the Wordinator:
[
The Wordinator will ruin a majority of those boxes into individual be aware coordinate boxes, such as “of” and “Engineering”, that are at this time section of the same field.
Eventually, now that we had a absolutely working conclude-to-conclude system, we generated more than ten million artificial words and educated our neural net for a very enormous option of iterations to squeeze out to boot-known accuracy as we would. All of this lastly gave us the accuracy, precision, and take numbers that all met or exceeded the OCR affirm-of-the-art.
We temporarily patted ourselves on the attend, then began to organize for the next fascinating stage: productionization.
At this point, we had a assortment of prototype Python and Lua scripts wrapping Torch — and a knowledgeable mannequin, of direction! — that showed that we’d make affirm-of-the-art OCR accuracy. On the different hand, this is a prolonged device from a system an trusty user can employ in a distributed surroundings with reliability, performance, and solid engineering. We wished to impress a distributed pipeline edifying for employ by millions of users and a system replacing our prototype scripts. Apart from, we needed to make that without disrupting the existing OCR system the utilization of the industrial off the shelf SDK.
Right here’s a diagram of the productionized OCR pipeline:
[
Overall Productionized OCR Pipeline
We started by creating an abstraction for totally different OCR engines, together with our safe engine and the industrial one, and gated this the utilization of our in-house experiments framework, Stormcrow. This allowed us to introduce the skeleton of our recent pipeline without disrupting the existing OCR system, which used to be already working in production for millions of our Industry possibilities.
We additionally ported our Torch basically based totally mannequin, together with the CTC layer, to TensorFlow for just a few reasons. First, we’d already standardized on TensorFlow in production to manufacture it more uncomplicated to adjust devices and deployments. 2nd, we make a selection to work with Python relatively than Lua, and TensorFlow has pretty Python bindings.
Within the recent pipeline, cell purchasers upload scanned epic photos to our in-house asynchronous work queue. When the upload is carried out, we then send the image through a A long way flung Map Call (RPC) to a cluster of servers working the OCR carrier.
The trusty OCR carrier uses OpenCV and TensorFlow, both written in C++ and with delicate library dependencies; so security exploits are an trusty reveal. We’ve remoted the trusty OCR share into jails the utilization of technologies take care of LXC, CGroups, Linux Namespaces, and Seccomp to make isolation and syscall whitelisting, the utilization of IPCs to talk into and out of the remoted container. If any individual compromises the penal advanced they’ll restful be totally separated from the rest of our system.
Our penal advanced infrastructure permits us to effectively place up costly resources a single time at startup, such as loading our educated devices, then be pleased these resources be cloned into a penal advanced to meet a single OCR ask. The resources are cloned Replica-on-Write into the forked penal advanced and are learn-most advantageous for the device in which we employ our devices so it’s relatively atmosphere edifying and quick. We needed to patch TensorFlow to manufacture it more uncomplicated to make that more or less forking. (We submitted the patch upstream.)
When we procure be aware bounding boxes and their OCRed textual sing, we merge them attend into the fresh PDF produced by the cell epic scanner as an OCR hidden layer. The user thus will get a PDF that has both the scanned image and the detected textual sing. The OCRed textual sing is additionally added to Dropbox’s search index. The user can now highlight and reproduction-paste textual sing from the PDF, with the highlights getting within the edifying affirm attributable to our hidden be aware field coordinates. They’ll additionally stare for the scanned PDF through its OCRed textual sing on Dropbox.
At this point, we now had an trusty engineering pipeline (with unit assessments and persistent integration!), but restful had performance components.
The major ask used to be whether or no longer we may perhaps well employ CPUs or GPUs in production at inference time. Coaching a deep net takes for well-known longer than the utilization of it at inference time. It is far same outdated to make employ of GPUs for the length of practicing (as we did), as they vastly decrease the amount of time it takes to sing a deep net. On the different hand, the utilization of GPUs at inference time is a tougher name to manufacture at this time.
First, having high-conclude GPUs in a production files center such as Dropbox’s is restful relatively uncommon and totally different than the rest of the quick. Apart from, GPU-basically based totally machines are more costly and configurations are churning faster in step with quick construction. We did a detailed diagnosis of how our Notice Detector and Notice Deep Catch performed on CPUs vs GPUs, assuming fleshy employ of all cores on every CPU and the characteristics of the CPU. After well-known diagnosis, we made up our minds that we’d hit our performance targets on exact CPUs at a comparable or decrease charges than with GPU machines.
When we made up our minds on CPUs, we then wished to optimize our system BLAS libraries for the Notice Deep Catch, to tune our network relatively, and to configure TensorFlow to make employ of accessible cores. Our Notice Detector used to be additionally a valuable bottleneck. We ended up surely rewriting OpenCV’s C++ MSER implementation in a more modular solution to steer clear of duplicating slack work when doing two passes (as a device to manage with both gloomy on white textual sing as nicely as white on gloomy textual sing); to repeat more to our Python layer (the underlying MSER tree hierarchy) for more atmosphere edifying processing; and to manufacture the code essentially readable. We additionally needed to optimize the put up-MSER Notice Detection pipeline to tune and vectorize determined slack parts of it.
In spite of the entire lot this work, we now had a productionized and extremely-performant system that we’d “shadow flip on” for a limited option of users, leading us to the third section: refinement.
With our proposed system working silently in production aspect-by-aspect with the industrial OCR system, we wished to substantiate that our system used to be surely better, as measured on steady user files. We make a selection user files privateness very critically at Dropbox, so we couldn’t exact see and test random cell epic scanned photos. As a alternative, we outmoded the user-image donation glide detailed earlier to procure analysis photos. We then outmoded these donated photos, being very careful about their privateness, to make a qualitative blackbox test of both OCR systems conclude-to-conclude, and had been contented to search out that we certainly performed the same or better than the older industrial OCR SDK, allowing us to ramp up our system to a hundred% of Dropbox Industry users.
Subsequent, we examined whether or no longer magnificent-tuning our educated deep net on these donated documents versus our hand chosen magnificent tuning image suite helped accuracy. Sadly, it didn’t hump the needle.
One other crucial refinement used to be doing orientation detection, which we had no longer carried out within the fresh pipeline. Photography from the cell epic scanner can also be rotated by ninety° and even the incorrect device up. We constructed an orientation predictor the utilization of yet any other deep net in step with the Inception Resnet v2 architecture, modified the last layer to foretell orientation, restful an orientation practicing and validation files place, and magnificent-tuned from an ImageNet-educated mannequin biased in direction of our safe wants. We place this orientation predictor into our pipeline, the utilization of its detected orientation to rotate the image to upright before doing be aware detection and OCRing.
One fascinating a part of the orientation predictor used to be that most advantageous a limited proportion of photos are essentially rotated; we wished to make certain our system didn’t inadvertently rotate upright photos (the most same outdated case) whereas trying to repair the orientation for the smaller option of non-upright photos. Apart from, we needed to resolve varied fascinating components in combining our upright rotated photos with the assorted solutions the PDF file structure can practice its safe transformation matrices for rotation.
Eventually, we had been shocked to search out some fascinating components with the PDF file structure containing our scanned OCRed hidden layer in Apple’s native Preview utility. Most PDF renderers respect areas embedded within the textual sing for reproduction and paste, but Apple’s Preview utility performs its safe heuristics to resolve be aware boundaries in step with textual sing space. This resulted in unacceptable advantageous for reproduction and paste from this PDF renderer, inflicting most areas to be dropped and the total words to be “glommed together”. We needed to make in depth sorting out all the device in which through a huge range of PDF renderers to search out the edifying PDF tricks and workarounds that would resolve this reveal.
In all, this entire round of researching, productionization, and refinement took about eight months, at the conclude of which we had constructed and deployed a affirm-of-the-art OCR pipeline to millions of users the utilization of fresh laptop imaginative and prescient and deep neural network tactics. Our work additionally affords a solid foundation for future OCR-basically based totally products at Dropbox.
Drawn to applying the most fresh machine finding out be taught to exhausting, steady–world problems and shipping to millions of Dropbox users? Our crew is hiring!
Brad Neuberg April 12, 2017
本文中我们会给大家介绍我们如何为移动端的文档扫描仪构建宇宙领先的 OCR 系统的全过程。我们用到了计算机视觉和诸如LSTM 、CTC、CNN、等深度学习技术。另外,我们还将介绍怎么样让 OCR 系统实实在在的运用到 Drobox 这样量级的生产系统中去。
在前一篇文章中 我们介绍了 Drobox 的移动端文档扫描仪的原理。移动端文档扫描仪使得你通过智能手机拍照,然后就可以扫描收据、发票等文档。我们的文档扫描仪只能输出图片,图片中的任何文字对于计算机来说都只是一些像素点,不能复制-黏贴,搜索或者其他针对文本的一些处理动作。
因此就需要用到 OCR 。这个过程会从我们扫描得到的文档图片中提取准确的文字。一旦 OCR 系统运行起来, Dropbox Business 用户就能拥有如下特性:
提取扫描件中的所有文本,并对其进行索引,后面你就能检索的到
保存成 pdf,然后你可以从中复制黏贴文本
我们在构建第一版移动文档扫描仪的时候,为了完成产品验证,在自己创建基于机器学习的OCR 系统之前,我们使用了一款商用的 OCR 库。也就是说要把商用软件和我们的扫描流程整合起来,提供上面的两大特性给我们的business user,看看他们是否能够从 OCR 之中得到帮助。对于移动扫描仪和 OCR 来说,一旦我们确认了的确存在很强烈的用户需求,出于以下原因,我们便着手开始开发自己的 OCR 系统。
首先,是基于成本的考虑:由于商用 OCR SDK 是根据扫描数量进行收费,自研的 OCR 系统会给我们剩下一大笔费用。 其次,商用的OCR 系统是根据平板扫描仪所针对的传统图像进行调优的,而我们的场景则更为复杂,因为智能手机拍照更加不受限制,会存在很多卷曲和不平整、带阴影的光线不均匀和模糊以及反光的文档图片。 因此,我们需要对识别准确率进行提升。
事实上,计算机视觉领域所发生的翻天覆地的变化给我们带来了契机。以往来说, OCR 系统是严重依赖人为构建和调优的模块,其中预设很多前置条件都满足的情况下,针对平板扫描仪所采集的文档图片。比如,其中一个模块用于检测文本行,下一个模块用于检测文本行中的单字并进行切分,另外一个模块用于对单个字符/单字采用多种不同技术进行识别。大多数方法在前期依赖输入图像二值化后的结果,而二值化过程难以控制,会遗漏一些重要的细节。构建这些 OCR 系统是非常专业化,非常费时劳力的,这些系统也只能处理那些平板扫描仪所采集的有限范围内的文档图片。
过去的几十年里,深度学习技术已经被应用到计算机视觉的诸多领域,对于我们来说,构建 OCR 系统的时候可以利用很多新工具,无需重复以往繁复的依赖人力的过程,取而代之的,我们需要大量的数据让我们的系统能够自动学习到之前需要手动设计过程的特征。
可能最重要的原因是自研的系统我们自主权更大,我们可以在以后尝试更多创新的点。
大多数商用的机器学习项目包含如下三个步骤:
调研、搭建原型看看是否靠谱
将1中得到的模型产品化,提供给实际的终端用户。
对上线的系统进行优化
我们按照这三个顺序进行介绍
========================
一开始我们的目标是看看能不能搭出来一个 state of art 的 OCR 系统。
首先要收集一些具有代表性的文档图片,主要是用户可能会上传的一些收据、发票和信件等。为了收集到这些数据,我们询问了一小部分用户是否愿意捐献一些他们的文件图片用于我们测试算法。Dropbox对用户隐私十分重视,也就是说用户捐不捐是完全自主的,如果捐献的话,文件本身会进行安全和隐私处理。比如,永远不会对捐献的数据进行本地化存储,记录审计日志,严格限制访问授权等等。
另外一块对于机器学习来说 就是对数据的标注。当今大多数机器学习技术都是有监督学习,也就是说需要明确的手动标注的数据,这样算法才能自己学习来进行预测。一般来说,标注工作是外包出去的,比如Amazon’s Mechanical Turk (MTurk). 但使用这样的平台有个缺点,标注人员可以看到数据,但我们不希望标注人员看到用户捐献的数据。
因此,我们为这些捐献的数据搭建了自己的标注平台,取名为DropTurk,这个平台既可以把标注任务提交到 MTurk,也可以分发给我们雇的一小撮专门标注用户捐献数据的人员。我们和这些标注人员都签订了极为严格的NDA来保证他们不会也不能保留或者分享他们所要进行标注的任何数据。DropTurk 中有一些标注任务UI模板,我们可以针对不同的数据集、标注任务快速组合,这样子标注速度会尤为快速。
比如,如下图所示的 UI 中,要对每个单词的图片进行标注,标注人员要完成以下几项:
誊写图片中真实的文字
标记是否单词朝向是否不正确
标记文本是不是英文
标记是不是看不清楚还是没有文字
DropTurk UI for adding ground truth data for word images
DropTurk 系统中包含了 dashboard,你可以查看既往标注任务的概况,当下进行的标注任务的进展,安全的获取标注结果。另外,我们可以对标注人员效率进行评估,甚至于监控到个体当下正在标注的任务,这样子就可以及时了解一些潜在的问题:
DropTurk Dashboard
利用 DropTurk ,我们收集了一个单词级别的数据集,其中包含了每个单词以及其标注文本的图片数据集,同时也得到了一个文档级别的数据集,其中包含了文档的图片以及全部的文本。我们利用文档图像数据集对已有的 state-of-art的 OCR系统进行了评估,这样子在自研的时候,我们就可以知道自己的目标该怎么设置。 On this particular dataset, the accuracy percentage we had to achieve was in the mid-90s.
首要的任务是确认是否OCR的问题是一个合理时间内可以解决的问题。我们把 OCR的问题分成两部分来看待。第一部分首先使用计算机视觉的技术将文档图片切割成文本行、单字,也就是说做一个word Detector;另一部分是将每个词的图片喂给一个深度网络从而得到实际的文字,我们称之为 Word Deep Net
Word Deep Net 整合了在计算机视觉和语音识别系统中用到的神经网络架构。切割好的单词图片喂给 Convolutional Neural Net (CNN) with several convolutional layers.CNN提取到的特征再以sequence的形式喂给a Bidirectional LSTM (Long Short Term Memory),最后利用Connectionist Temporal Classification (CTC) 层来做文本预测。
一旦我们选定使用这样的网络结构来实现单词图片到文本的转换,我们就需要分析怎么样才能拿到用于训练这个网络的数据 。深度学习系统需要巨量的数据来达到一个比较好的识别效果。事实上,眼下的此类系统训练数据的规模是最主要的瓶颈。一般来说都需要人工采集和标注此类数据,是一个耗时耗力又烧钱的过程。
另外一种选择是人工合成训练数据。但对于大多数计算机视觉要解决的问题来说,很难合成用于训练算法的接近真实数据的训练数据。要想有效的模仿真实世界中图片环境的多样性和多变性是很困难的。但针对我们自己的OCR场景则是很适合使用人工合成训练数据的,因为与自然场景的对象或者大多数人造的对象不同,我们所要合成的文档图片范围有限,单词的范围多样性也是很有限的。
我们的合成数据流程包含三部分:
单词词表
字体集
几何、光学变换来仿真真实世界
合成算法随机从其中选择一个来生成训练样本。
单词词表是利用 Project Gutenberg 上的电子书整理得到的,收集了大约1000种字体,加上旋转、下划线和模糊等失真手段。最终生成了大约100万单词图片,训练得到的模型然后进行准确性测试,最终大概在 79%,不太理想,但也还行。
通过多轮迭代,我们对利用合成数据的流程每个环节进行了改进以提高识别正确率。比如:
观察发现收据的效果很一般,在词表中增加了Uniform Product Code (UPC) database中的词,比如 “24QT TISSUE PAPER”
观察发现我们的网络没法很好的处理那些断开的字符。也就是说:那些打印的时候选取 thermal font出来的收据,通过会存在斑点、断开以及墨水覆盖的字符,但我们的网络训练的时候遇到的都是一些连接很完整的(比如激光打印机)、lightly bit-mapped characters (截屏的图片)。为了解决这个问题,我们联系到一家供应商,愿意给我们一种极具代表性的古老的 thermal printer fonts.
利用收据中常常出现的thermal打印机字体合成的单词图片
期初我们选择的字体的过程过于天真。最终我们人工选定了其中2000种字体。并没有平等的对待所有字体。于是我们研究了最常用的50种字体,并对字体使用的频率进行了记录,这样子常用字体采样频率会更高,也不会拉下那些比较罕见的字体。另外,其中一些字体的符号用错了,变成了方框,或者大小写错了。我们不得不人肉过了一遍这2000种字体,保证我们没有把错误的符号、数字、框喂给我们的网络。
将 Word Detector 调优至 高召回低精度。这样子就不会错过任何可能是文本的情况。对于识别网络来说,就需要处理很多只是噪声的空白图片。这样我们在训练数据中就加入了一些空白图片,比如常见的文字纹理背景, like wood, marble countertops。
Synthetically generated negative training examples
从合成的单词直方图我们发现一些符号的数量远远不足,比如 / or &. 然后人工对这些字符出现频率进行了调整,比如合成了一些 dates, prices, URLs, etc.
新增了很多转换手段,比如 warping, fake shadows, and fake creases, and much more.
Fake shadow effect
数据对于最终的模型至关重要,所以我们花费了大量的时间对数据合成过程进行了优化。后续我们会开放这部分源代码,供大家训练和测试自己的系统之用。
我们利用 on Amazon EC2 G2 GPU instances 训练我们的网络模型,并行的做很多实验。所有实验代码都归结在一份notebook中,其中包含了重现实验的所有内容。比如用 git hash 来控制所用到的代码,S3中保存合成的数据集、结果、评估结果、图形以及实验目标的概述。 在合成数据,训练模型的过程中,我们会造了一些轮子,比如可视化字体的工具,调试网络预测结果的工具等等
Example early experiment tracking error rate vs. how long our Word Deep Net had trained, against an evaluation dataset that consisted of just single words (Single Word Accuracy)
早期实验的时候,我们记录了 Deep word net 处理 单个单词图片的结果怎么样,我们称之为SWA 单词准确率。准确指的是网络识别出多少个单词。另外,我们也记录了网络的 precision and recall 。
注:
精确率/准确率(Precision)是指在所有系统判定的“真”的样本中,确实是真的的占比,就是TP/(TP+FP)。 召回率/查全率(Recall)是指在所有确实为真的样本中,被判为的“真”的占比,就是TP/(TP+FN)。 FPR(False Positive Rate),又被称为“Probability of False Alarm”,就是所有确实为“假”的样本中,被误判真的样本,或者FP/(FP+TN)
每周我们都记录实际的进展。我们将数据集分成好几类,比如 register_tapes
(receipts), screenshots
, scanned_docs
, 等,根据不同的分类计算每种分类对应的准确率和所有数据的准确率。下图是一开始我们的跑的一个 Word Detector和 Word Deep Net的效果,其实还蛮糟糕的 。
Screenshot from early end-to-end experiments in our lab notebook
到了一定阶段我们合成数据的处理流程在我们的OCR 测试基线数据集上,SWA单字准确率达到了88-89%以上,我们认定这部分工作差不多了。于是收集了近2万张单词的真实图片(我们前面用了大概100万人工合成的单词图片),用这些真实图片对 Word Deep Net 进行调优。最后差不多 SWA 能达到95%左右。
这样子我们就形成了一个能处理单词图片的系统,但真正的 OCR 系统要处理的是整个文档的图片。下一步我们将重点关注文档级别的 Word Detector 单词检测
这里我们选择了非DL的方法。类似可选的方法主要是对象检测系统中使用的诸如RCNN, 用于检测定位猫猫狗狗的位置,大多数图片中对象实例数不超过5个。
但大多数文档中的字数不止几个,多大数百上千个。数量远超大多数基于神经网络的对象检测系统所能处理的数量级。我们并不能保证此类算法适用于我们的 OCR 系统。
另外一个考虑是传统的计算机视觉的方法所采用的特征检测方法更易于调试,神经网络的方法难于理解和解释。
最终我们选择了经典的图形学算法 Maximally Stable Extremal Regions (MSERs), 用了 OpenCV’库中的实现. MSER 算法根据不同的权重/阈值寻找图片中的连通区域。也就是说检测图片中的区块,特别适合文本的检测。
注:更多可阅读
我们的 Word Detector首先检测图片中的 MSER 特征,然后将其整合起来形成单词检测和行数据检测,有一点需要注意的是我们的Deep Word net只吃固定大小的单词图片的输入。因此 word detector 有时候需要在一个检测框里包含多个单词,有时候单词太长又需要对单词进行切割。切割的信息需要在整个处理流程中进行传递,这样子才能对最终的结果进行整合。另外一点就是MSER 检测工具需要同时处理浅色文字背景很暗和深色文字浅色背景的图片。
一旦我们把 Word Detector 调优至一定程度,就可以把它与 Word Deep Net整合起来对整个文档图片而不是单词的图片进行测试。但当我们第一次跑完之后发现效果只有44%,非常糟糕。
主要原因是图片中空格和噪声产生的垃圾文本。有时候会把两个单词合并成一个,比如 “helloworld”, 或者将单词错误的切割开来,比如“wo rld”.
采用的解决方案是修改网络中的 CTC 层来给最终预测的文本输出一个置信度。然后根据置信度来进行相应处理:
另外需要处理的一个点是前面提到的Word Deep Net需要固定大小的图片输入。也就是说一个单词的窗宽实际上可能会有多个单词或者只是一个很长单词的一部分。因此,我们用了一个叫 Wordinator的模块,它能够输出每个识别后单词的具体的bouding box 也就是上下左右的坐标位置。
例如下面的调试中你可以看到在使用 Wordinator之前我们检测到的单词bouding box
Wordinator 可以将一些bouding box 分成 单个单词的bouding box 比如“of” 和 “Engineering”。
最终,我们的系统可以运行了,我们利用1千万人工合成的数据,尽可能多次数的迭代来训练我们的模型,以期达到最好的效果。最终准确率、召回率都达到了 state of art的水准。
这时候,我们有了Python 和Lua封装的 Torch 实现的原型,准确率已经达到了state of art的水准。但要最终完成一个用户可以使用的可靠性、性能、工程化还有很长一段距离。我们需要构建一个系统,能够支撑百万级的用户。另外还需要考虑无需破坏现有的使用商用 OCR SDK 的系统。
如下图所示是整个 OCR 生产系统的处理流程:
Overall Productionized OCR Pipeline
我们为不同的OCR 引擎(我们自研的还有商用的)创建了一层封装,采用了自研的Stormcrow.这样我们就无需对现有的服务百万用户的线上 OCR 系统进行调整即可引入新的处理流程。
同时我们也把基于 Torch 的模型迁移到TensorFlow 。原因主要有以下几点,其一是我们线上已经有标准化的 TensorFlow的系统,这样子易于我们模型的管理和部署。其二是倾向于Python而不是Lua,TensorFlow的Python库更加好用。
在新的处理流程中,移动端上传拍照扫描的文档图片到我们的异步队列中。当上传动作完成以后,我们将图片通过一个RPC调用发送给运行 OCR 服务的服务器集群。
OCR 服务用到了 OpenCV 和 Tensorflow,都是使用 C++ 编写的,依赖库也很复杂,因此安全是一个考虑因素。我们利用诸如 LXC, CGroups, Linux Namespaces, and Seccomp的技术实现 OCR 服务的隔离,容器之间通过IPC来交互。
这样的架构使得我们可以在初始时很快的构建,比如加载训练好的模型,然后复制这些资源供单个OCR请求来使用。
一旦我们得到了字符的位置坐标以及识别好的文本结果,我们会将其整合在一个PDF里。用户拿到的这个PDF中既有原始的图片也有识别好的文字。识别的文字也会添加到 Dropbox的搜索索引中去。用户可以在PDF中高亮、复制粘贴文本。也可以对文本内容进行搜索。
到了这一步,我们已经有了一个真正能跑起来的工程化的系统,但性能可能还存在一些问题。
第一点是在预测阶段我们要不要使用 GPU 。训练比预测耗时长得多,训练的时候使用GPU是很常见的,但预测的时候到底要不要使用 GPU呢则很难抉择。
首先是 对于像 Dropbox这样的企业来说生产系统还没有大量用GPU。另外GPU的价格也更贵。对于 Word Detector,我们在 CPU vs GPU 上进行了测试,发现即使跑在CPU上,性能差异和GPU的机器差不太多,成本也更低。
决定了要用 CPU, 就需要对系统进行优化。Word Detector 是我们的一大瓶颈,后来我们OpenCV’s C++ MSER 的实现。
这些都搞完了,系统就可以支撑一小波人使用了。下一个阶段就是进一步提升
当我们新系统和原来的商用OCR系统同时跑在线上的时候,我们想知道新系统在真实的用户数据上是不是效果真的更好。Dropbox 我们十分重视用户隐私,不可能随便拿用户的数据出来测。我们还是沿用了前面提到的用户捐献数据的流程来获取数据。然后用这些图片数据来做端到端的定量分析,最终发现效果的确是和之前的商用OCR SDK一样或更好,因此我们把线上新系统服务用户的比例扩展到100%。
然后,我们测试了使用用户真实调优的模型是不是比我们人工选定的数据调优的效果更好,结果是二者不相上下。
另一个优化点是旋转检测,一开始我们的系统没有这个特性。移动端拍照得到的图片可能是90度或者180度旋转的。我们采用 Inception Resnet v2网络构建了一个检测器,将最后一层修改成预测方向。并将此整合到 OCR 处理流程中。
只有很少一部分图片是真正存在旋转的,我们需要确保没有对大部分不存在旋转进行处理。 In addition, we had to solve various tricky issues in combining our upright rotated images with the different ways the PDF file format can apply its own transformation matrices for rotation.
最后,我们发现在 Apple自带的 Preview应用程序中 我们生成的带OCR结果文字的PDF 会有一些bug。 处于复制粘贴的目的,大多数 PDF渲染工具会考虑空格,但 Apple自带的 Preview应用程序使用自己的逻辑来确定字符的位置,这时候从PDF中复制文本就会出现所有的词都粘在一起。于是我们得把 PDF渲染工具都测试一遍 ,有问题就需要进行修复。
总体上来看,整个调研、生产上线,优化的过程耗时8个月,最终我们得到了一个 state-of-art的OCR 系统。
使用 Dropbox 文档扫描仪帮助保存、整理和分享您的工作,直接在手机上进行。
告别文书工作
使用 Dropbox 移动应用,轻松上传和整理白板、收据、草图等文件的扫描件。现已支持在 iOS 和 Android 设备上使用。
得益于光学字符识别 (OCR) 技术,Dropbox Business 用户甚至可以搜索扫描件内的文字。
下面是一些充分利用文档扫描仪的方法
从此不用为了准备费用报告而费心收集成堆的差旅收据,更不用担心收据丢失了。走到哪里您都可以将收据扫描到 Dropbox 中,日后再轻松查找并提交收据。
无需再在会议结束时疯狂抄录白板上的记录。只需使用文档扫描仪,便可将 PDF 或 PNG 格式的白板图片上传到您选择的文件夹,随时与其他与会者共享。
扔掉文件柜,以数字副本形式保存重要文件。您可以将它们井然有序地保存在 Dropbox 中,并从任意设备访问。
将老照片扫描到 Dropbox 中,既能安全妥善地保管,又能轻松地随时查看。您只需要有照片即可,没有任何特殊设备要求。
再也不会丢失重要人脉关系!直接通过手机就可以将在大小会议和展会上获得的名片轻松上传到 Dropbox。
灵感随处可得,说不定当您在医生办公室里翻阅杂志时就出现了。但是,您不用撕破别人的杂志,只需将页面或摘录扫描到 Dropbox 中即可供日后参考。
想要保存深思熟虑的个人笔记,但不希望变得凌乱不堪?只需将它们扫描到 Dropbox 中,即可获得整齐有序的备忘录。
扫描课堂或研讨会上的资料,既方便日后参考,又节省家庭或办公室空间。
灵感来敲门了?借助 Dropbox 文档扫描仪,您随时可以捕捉下一个英明的业务点子。
@wanghaisheng i would really like to integrate with my own application. Can you please help me with the process?
@rohitgarg29 sorry i am not so code guy
https://github.com/pannous/tensorflow-ocr
Developing a Standard OCR Pipeline The utilization of Pc Imaginative and prescient and Deep Discovering out doc scanner in dropbox
Fast and Accurate Document Detection for Scanning Fast Document Rectification and Enhancement
Improving the Responsiveness of the Document Detector Creating a Modern OCR Pipeline Using Computer Vision and Deep Learning