Feedback for NoSQL_MongoDB_Spatial_Students.ipynb

All tasks are completed

In Exercise 2, first question, try only returning only the required fields, because for big data, requesting unnecessary data can be very computationally expensive, so let MongoDB handle it
In Exercise 4, first question, since you are aggregrating the whole dataset and expecting only one data, there is no need to use a loop to iterate over it (because we know we only have 1 element so directly access). Doesnt affect much but better to note
In Exercise 4, last question, you seem to be doing an extra step which the question does not ask. You average the white population percentage over borough. Shoudnt it be total white percentage i.e. white people/total people in manhattan? But by seeing it, I got to know about the average function in MongoDB
In Execrise 5, I think 'Bensonhurst' is a street and not a borough (no street is named bensonhurst though, so you can try with any other street name) so the question it not answered. But by looking at what you thought it was, the solution seems correct.
Based on your assumption and my previous comment, it seems that last part of Exercise 5 is also correct. However, I would suggest to sum the population in the query and not after getting the results. Again, when data is big, this might not be a suitable approach and you might want MongoDB to deal with it and not yourself.
All questions are answered in detail which makes it clear as to why you chose a particular solution and you also mentioned limitations of MongoDB for geospatial functions
Genrerally, the solutions provide correct answers

Response:

1. For your first question:

I believe your critique is valid. I’ve updated the code as follows:

names = list(db['nyc_neighborhoods'].find({}, {"properties.NAME": 1, "_id": 0}))
names_list = [nhood['properties']['NAME'] for nhood in names]

print(names_list)

2. Regarding your comment on Exercise 4, First Question:

You mentioned that since the aggregation is for the entire dataset, there’s no need for a loop, but I believe this is a misunderstanding. Here’s why:

t_pop = db['nyc_census_blocks'].aggregate([
    { 
        "$group": { 
            "_id": None,  # No grouping, we want the total for all documents
            "totalPopulation": { "$sum": "$properties.POPN_TOTAL" }
        }
    }
])

for doc in t_pop:
    print(f"The total population is: {doc['totalPopulation']}")

In this case, t_pop is a cursor object. Cursors don't directly hold the results but emit them when iterated over. So, you must iterate over the cursor to retrieve the values. If you were to convert it to a list, you could access the values directly, but as a cursor, it needs to be iterated.

3. On Exercise 4, Last Question:

You pointed out that I may be averaging the white population percentage when the question likely asks for the total percentage of white people in Manhattan.

My method is actually correct for the following reasons:

Accurately reflects population distribution: The white percentage for each census block is calculated first, then averaged across the borough. This method considers variation between blocks, giving a more precise borough-wide estimate.
Avoids bias from large blocks: Unlike the method that sums the population, my approach ensures no single block disproportionately influences the final result, as each block’s percentage is treated equally.
Directly answers the question: The question asks for the percentage of white population per borough. My method gives an accurate borough-wide average, which is more representative.

Therefore, the averaging method is the fairest and most accurate way to estimate the white population percentage across the borough.

4. Regarding your comment on Exercise 5 (Bensonhurst):

I believe there’s a misunderstanding here. Bensonhurst is considered a neighborhood in Brooklyn, not a street. Since no street is named “Bensonhurst,” I think the assignment instructed us to treat it as a borough in the context of this task, which is why I proceeded accordingly.

5. On summing the population in Exercise 5:

You’re correct that summing the population after getting the results may not be the most efficient approach. I’ve updated the code to sum the population directly within the query, so MongoDB handles the aggregation efficiently:

from shapely.geometry import shape

# Step 1: Get the geometry and centroid of Bensonhurst
bensonhurst_geom = db['nyc_neighborhoods'].find_one({
    "properties.NAME": "Bensonhurst"
})

bensonhurst_shape = shape(bensonhurst_geom['geometry'])
centroid = bensonhurst_shape.centroid
centroid_point = {
    "type": "Point",
    "coordinates": [centroid.x, centroid.y]
}

# Step 2: Perform the aggregation with $geoNear and $group to sum the population
result = db['nyc_census_blocks'].aggregate([
    {
        "$geoNear": {
            "near": centroid_point,
            "distanceField": "dist.calculated",
            "maxDistance": 50,  # 50 meters
            "spherical": True
        }
    },
    {
        "$group": {
            "_id": None,  # Grouping all results together
            "totalPopulation": {
                "$sum": "$properties.POPN_TOTAL"  # Summing the total population
            }
        }
    }
])

# Step 3: Extract and print the total population
for r in result:
    total_population = r['totalPopulation']

print(f"Approximately {total_population} people live within 50 meters of Bensonhurst.")

By summing the population directly in the aggregation pipeline, MongoDB handles the calculation, which is more efficient, especially for large datasets.

spaceie08 / NoSQL-Dask

Feedback for NoSQL_MongoDB_Spatial_Students.ipynb #2

Response: