spaceie08 / NoSQL-Dask

The repository contains the two notebooks for the mini project II for the course of Big Geo data processing.
0 stars 0 forks source link

Feedback for NoSQL_MongoDB_Spatial_Students.ipynb #2

Open taimoorh13 opened 2 weeks ago

taimoorh13 commented 2 weeks ago

All tasks are completed

spaceie08 commented 2 weeks ago

Response:

1. For your first question:

I believe your critique is valid. I’ve updated the code as follows:

names = list(db['nyc_neighborhoods'].find({}, {"properties.NAME": 1, "_id": 0}))
names_list = [nhood['properties']['NAME'] for nhood in names]

print(names_list)

2. Regarding your comment on Exercise 4, First Question:

You mentioned that since the aggregation is for the entire dataset, there’s no need for a loop, but I believe this is a misunderstanding. Here’s why:

t_pop = db['nyc_census_blocks'].aggregate([
    { 
        "$group": { 
            "_id": None,  # No grouping, we want the total for all documents
            "totalPopulation": { "$sum": "$properties.POPN_TOTAL" }
        }
    }
])

for doc in t_pop:
    print(f"The total population is: {doc['totalPopulation']}")

In this case, t_pop is a cursor object. Cursors don't directly hold the results but emit them when iterated over. So, you must iterate over the cursor to retrieve the values. If you were to convert it to a list, you could access the values directly, but as a cursor, it needs to be iterated.


3. On Exercise 4, Last Question:

You pointed out that I may be averaging the white population percentage when the question likely asks for the total percentage of white people in Manhattan.

My method is actually correct for the following reasons:

Therefore, the averaging method is the fairest and most accurate way to estimate the white population percentage across the borough.


4. Regarding your comment on Exercise 5 (Bensonhurst):

I believe there’s a misunderstanding here. Bensonhurst is considered a neighborhood in Brooklyn, not a street. Since no street is named “Bensonhurst,” I think the assignment instructed us to treat it as a borough in the context of this task, which is why I proceeded accordingly.


5. On summing the population in Exercise 5:

You’re correct that summing the population after getting the results may not be the most efficient approach. I’ve updated the code to sum the population directly within the query, so MongoDB handles the aggregation efficiently:

from shapely.geometry import shape

# Step 1: Get the geometry and centroid of Bensonhurst
bensonhurst_geom = db['nyc_neighborhoods'].find_one({
    "properties.NAME": "Bensonhurst"
})

bensonhurst_shape = shape(bensonhurst_geom['geometry'])
centroid = bensonhurst_shape.centroid
centroid_point = {
    "type": "Point",
    "coordinates": [centroid.x, centroid.y]
}

# Step 2: Perform the aggregation with $geoNear and $group to sum the population
result = db['nyc_census_blocks'].aggregate([
    {
        "$geoNear": {
            "near": centroid_point,
            "distanceField": "dist.calculated",
            "maxDistance": 50,  # 50 meters
            "spherical": True
        }
    },
    {
        "$group": {
            "_id": None,  # Grouping all results together
            "totalPopulation": {
                "$sum": "$properties.POPN_TOTAL"  # Summing the total population
            }
        }
    }
])

# Step 3: Extract and print the total population
for r in result:
    total_population = r['totalPopulation']

print(f"Approximately {total_population} people live within 50 meters of Bensonhurst.")

By summing the population directly in the aggregation pipeline, MongoDB handles the calculation, which is more efficient, especially for large datasets.